alphabet - alphabet and alpha-digit recognizer
package require Alphabet
alphabet initialize recogvar {directory NULL} {type ALPHA} {pau NULL}
{btheap NULL} {infovar NULL}
alphabet pipe recogvar w
alphabet result recogvar {nbest 3}
alphabet nuke recogvar
alphabet reset recogvar
alphabet implements the CSLU alphabet and alpha-digit recognition engine. Recognition is a three stage process. First a "conventional" frame-based recognizer with and alpha, digit or alpha digit vocabulary is run on the speech. The letters and digits found are then reclassified with a whole-word classifier (a larger neural network) which produces an output value for each of the 26 letters and 10 digits, plus one for NL (not a letter) which is trained on miscellaneous sounds in the alpha/digit corpus. Finally, these letter and digit scores are used to find the top scoring names in a directory, which has a tree structure in which entries share common prefixes. Because of the tree structure, it is possible to efficiently search hundreds of thousands of names (given sufficient memory).
The first pass takes the most time because there are many more frames than letters. It can be pipelined like the other CSLU frame-based recognizers using the alphabet pipe function (which can also be called once for the entire utterance, of course).
The initialize function call creates an instance of the alphabet/alpha-digit recognizer. Multiple recognition engines may thus be created through successive function calls to alphabet initialize. The type of the recognizer is decided by the optional parameters type and pau.
The type parameter indicates whether the grammar contains either letters of the alphabet only (type set to ALPHA), letters of the alphabet and digits (type set to ALPHADIGIT), or digits only (type set to DIGIT). The same recognition engine is used in all three cases, except the first pass grammar is constrained according to the specification and that will lead to better segmentation and better performance. Also, the scores matrix returned (see below) is limited according to the specification.
The pau parameter which defaults to NULL (i.e., no pause) indicates whether the grammar expects fluent spoken letters or digits or whether the grammar expects forced pauses between letters or digits. If you know the users will pause, then setting pau (making it anything but {}) will help performance.
The infovar variable allows access to a couple of tunable parameters. The defaults for these are:
The alphabet result function call will do the second and third stages, calling the alpha/digit neural network for each of the letters or digits found in the first pass and then updating the directory search tree (if any) with the scores. If there was a directory, the top N scoring names are returned along with their confidence. Here is the structure of the returned list:
0: {{name1 conf1} {name2 conf2} .. {nameN confN}}
1: {{raw-let-1 letconf1} {raw-let-2 letconf2} .. {raw-let-M letconfM}}
2: letter segmentation
3: phoneme segmentation
The confidence for a letter is the relative likelihood of a true instance of the letter getting that score (the output of the nnet) or worse compared with the likelihood of some extraneous speech getting that score or better, as estimated on OGI speech corpora. A score of .5 means that score is equally likely to be in vocabulary as out (assuming the prior probabilities are equal).
The confidence for names is computed from the corresponding letter confidences by taking the geometric means. Based on some early experience, using a threshold of .3 for rejection seems to strike a good balance, but this will depend on many factors.
The raw scores are no longer in the return list, but are available in an arrayF structure:
set res [alphabet result abc] set scores $abc(scores) puts [lindex [lindex [mx puts $scores.(0:0,2:2)] 0] 0]
The above code prints the score for zero in the third position for ALPHADIGIT or DIGIT recognition, and prints the score for "a" in ALPHA recognition.
The alphabet nuke function destroys(nukes) all associated memory of a recognizer and search engine.
Directories are simply sets of alpha/digit strings. The directory has to be in a certain format before it can be used. The command line program "precompile" in the CSLUDIR bin is used to turn a list of alpha/digit strings into a usable (tree-structured) format:
precompile list > list.cn
package require Alphabet 1.0 alphabet initialize recog names_50k.cn set w [wave read firstname.wav] alphabet pipe recog $w set res [alphabet result recog 2] set names [lindex $res 0] set topname [lindex $names 0] set secondname [lindex $names 1] puts "the top name was [lindex $topname 0]" puts "the second was [lindex $secondname 0]" alphabet nuke recog
alphabet result returns the spelled name retrieved if the directory option was set, the word level recognition alignment, and the phoneme level alignment.
Johan Schalkwyk
Mark Fanty
Center for Spoken Language
Understanding
Oregon Graduate Institute of Science &
Technology