Tech Blog

Jay's Technical blog

WP8 Speech Recognizers

15 April 2013
Jay Kimble

[WARNING! This is an archived post and as such there may be things broken/missing here.. you have been warned.]

If you don’t know I have been deep into WP8’s Speech SDK for the last 6 months. So much so that I am starting to help others along the way. I just got a question about how to get the confidence rating on a recognized word.

If you don’t know what that means, let me explain. Basically the speech recognition is not an exact science. There are complicated algorithms that analyze your speech to determine what words were spoken. One of the cool things with WP8’s SDK is that you can provide a list of words or phases you are looking for and this makes things a little more accurate (mainly because you are limiting the number of combinations to look for).

WP8 has 2 objects that you can use to recognize speech: SpeechRecognizer and SpeechRecognizerUI.

SpeechRecognizer gives you something a little more low level. It won't show the “pretty” UI that is shown while listening to the user speak; it won’t play a beep sound to indicate that speech is being recorded for recognition, nor does it give the user any feedback whatsoever. It does let you supply a list of words or phrases you are looking for as well as will do the big check for listening for any word. The SpeechRecognizerResult will give you the word (or words) it thinks the user said. It also will give you a list of alternates (via the GetAlternates() method) that might be what the user said. It also gives you a number to tell you how accurate it estimates that it was (and gives this rating for each of the alternates as well). It will also let you do your own recording and will process the speech from your recording (but it accepts a much smaller clip than what the actual MS mechanisms do). Actually doing a recognize for any word will allow for a shorter amount of data.

SpeechRecognizerUI does a lot of work for you. It puts up the UI, plays the beep to inidicate to the user to speek, handles the case where a word isn't quite recognized, and generally let’s the user know what is going on throughout the process. It returns a SpeechRecognizerUIResult object which contains SpeechRecognizerResult (in the RecognizerResult property), but the confidence is usually high (I have yet to see any alternates come through). I think this is mainly because the UI object does the extra work of clarifying with the user when it estimates that the accuracy of the recognition isn’t quite that high.

From my work, I have found that I tend to use SpeechRecognizerUI more often the the lower level mechanism. Mainly because the added UI/indicators creates a very good experience for the user.

[I find it difficult to not attribute human qualities to the recognizer. It was tough not to use words like “guess” during this post.. the recognizer is really amazing how accurate it is and when it is wrong, it’s usually pretty understandable why.]