Postscript Version

A Hierarchical Framework for Speech Recognition and Understanding

James Glass, Stephanie Seneff, and Victor Zue

Laboratory for Computer Science
Massachusetts Insitute of Technology

CONTACT INFORMATION

MIT Laboratory for Computer Science
545 Technology Square, Room 601
Cambridge, MA 02139
Phone: (617) 253-1640 (Glass), -0451 (Seneff), -8513 (Zue)
FAX: (617) 258-8642
Email: glass@mit.edu, seneff@mit.edu, zue@mit.edu,

WWW PAGE

http://www.sls.lcs.mit.edu

PROGRAM AREA

Speech and Natural Language Understanding.

KEYWORDS

Speech recognition, speech understanding, lexical representation, sub-lexical modelling, morpho-phonological modelling.

PROJECT SUMMARY

The majority of current computer speech recognition systems model the speech signal with homogeneous observation frames, represent words by a string of phonemes, and rely heavily on statistical word-based language models to decode the underlying word sequence. This project, which began in March 1997, aims to investigate an alternative approach that incorporates many more levels of linguistic information into a parsimonious hierarchical framework for speech recognition and understanding. This approach will provide new perspectives on incorporating constraints from the distinctive feature, phonetic, phonological, syllabic, morphological, lexical, syntactic, and semantic levels into a probabilistic framework for speech recognition and understanding. Structure sharing of sub-word levels across words will allow for the generalization of phonological effects across similar environments and increased flexibility for dynamic vocabularies and language models. Structure sharing should also produce a more efficient search with a smaller number of parameters. The proposed hierarchical framework also has the potential of serving as a recognition kernel, with the speech signal as input and a set of morpho-phonological units as output. This kernel would have a finite inventory of units for a given language, whose internals will be vocabulary and task independent. To ensure that the proposed framework is language independent, its utility will also be investigated for languages other than English.

PROJECT REFERENCES

J. Glass, J. Chang, and M. McCandless, ``A Probabilistic Framework for Feature-Based Speech Recognition,'' Proc. ICSLP (Int'l Conference on Spoken Language Processing), 2277-2280, Philadephia, October, 1996.

S. Seneff, R. Lau, and H. Meng, ``ANGIE: A New Framework for Speech Analysis Based on Morpho-Phonological Modelling,'' Proc. ICSLP, 110-113, Philadephia, October, 1996.

AREA BACKGROUND

Verbal communication with computers has long been recognized as a desirable, perhaps even necessary, ingredient of an intuitive human-computer interface for ordinary citizens, since it is by far the most natural, flexible, efficient and economic way for humans to communicate. Great strides have been made by the research community over the last two decades. However, there are many important unsolved problems that will prevent the technology from reaching its full potential. To find the solutions to these problems, one may need to step back from the dominant technology (e.g., hidden Markov modelling) and explore new approaches capable of making better use of the rich sources of constraints that exist in the human communication process.

Regardless of the specific recognition approach, present-day speech recognition systems, including our own, do not fully utilize the constraints that exist in the speech communication process. As an example, consider the representation of words in the lexicon. Most systems nowadays represent words in terms of subword units such as phonemes or phones, and they are modelled based on the context in which they appear. Such a mapping from (context-dependent) phonemes directly to words ignores the intermediate linguistic levels that exert influence on the acoustic realization of words. For example, the duration of the /p/ release differs significantly for the two words ``peak'' and ``speak,'' due to their syllable structures, but the same acoustic differences for ``display'' and ``misplace'' can only be identified through morphological analysis. Similarly, the homorganic rule stating that nasal-stop combinations must have the same place of articulation operates when they are both in the coda position, as in ``think'', but not when the stop is in the affix position, as in ``dreamt.'' Examples such as these can be found at many intermediate levels between the acoustic signal and words in the lexicon. Proper utilization of these constraints has the potential of greatly increasing the word accuracy while decreasing the reliance on word-level statistical language models.

AREA REFERENCES

C.-H. Lee, F. Soong, and K. Paliwal (eds.), Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers, 1996.

J. Perkell and D. Klatt (eds.), Invariance and Variability in Speech Processes, Lawrence Erlbaum Associates, 1986.

L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.

U. Frauenfelder and L. Tyler (eds.), Spoken Word Recognition, MIT Press, 1987.

A Waibel And K.-F. Lee (eds.), Readings in Speech Recognition, Morgan Kaufmann, 1990.

RELATED PROGRAM AREAS

1. Virtual Environments.

3. Other Communication Modalities.

4. Adaptive Human Interfaces.

6. Intelligent Interactive Systems for Persons with Disabilities.