Postscript Version
A Hierarchical Framework for Speech Recognition and Understanding
James Glass, Stephanie Seneff, and Victor Zue
Laboratory for Computer Science
Massachusetts Insitute of Technology
CONTACT INFORMATION
MIT Laboratory for Computer Science
545 Technology Square, Room 601
Cambridge, MA 02139
Phone: (617) 253-1640 (Glass), -0451 (Seneff), -8513 (Zue)
FAX: (617) 258-8642
Email: glass@mit.edu,
seneff@mit.edu,
zue@mit.edu,
WWW PAGE
http://www.sls.lcs.mit.edu
PROGRAM AREA
Speech and Natural Language Understanding.
KEYWORDS
Speech recognition, speech understanding, lexical representation,
sub-lexical modelling, morpho-phonological modelling.
PROJECT SUMMARY
The majority of current computer speech recognition systems model the
speech signal with homogeneous observation frames, represent words by
a string of phonemes, and rely heavily on statistical word-based
language models to decode the underlying word sequence. This project,
which began in March 1997, aims to investigate an alternative approach
that incorporates many more levels of linguistic information into a
parsimonious hierarchical framework for speech recognition and
understanding. This approach will provide new perspectives on
incorporating constraints from the distinctive feature, phonetic,
phonological, syllabic, morphological, lexical, syntactic, and
semantic levels into a probabilistic framework for speech recognition
and understanding. Structure sharing of sub-word levels across words
will allow for the generalization of phonological effects across
similar environments and increased flexibility for dynamic
vocabularies and language models. Structure sharing should also
produce a more efficient search with a smaller number of parameters.
The proposed hierarchical framework also has the potential of serving
as a recognition kernel, with the speech signal as input and a set of
morpho-phonological units as output. This kernel would have a finite
inventory of units for a given language, whose internals will be
vocabulary and task independent. To ensure that the proposed
framework is language independent, its utility will also be
investigated for languages other than English.
PROJECT REFERENCES
J. Glass, J. Chang, and M. McCandless, ``A Probabilistic Framework for
Feature-Based Speech Recognition,'' Proc. ICSLP (Int'l Conference on
Spoken Language Processing), 2277-2280, Philadephia, October, 1996.
S. Seneff, R. Lau, and H. Meng, ``ANGIE: A New Framework for Speech
Analysis Based on Morpho-Phonological Modelling,'' Proc. ICSLP,
110-113, Philadephia, October, 1996.
AREA BACKGROUND
Verbal communication with computers has long been recognized as a
desirable, perhaps even necessary, ingredient of an intuitive
human-computer interface for ordinary citizens, since it is by far the
most natural, flexible, efficient and economic way for humans to
communicate. Great strides have been made by the research community
over the last two decades. However, there are many important unsolved
problems that will prevent the technology from reaching its full
potential. To find the solutions to these problems, one may need to
step back from the dominant technology (e.g., hidden Markov modelling)
and explore new approaches capable of making better use of the rich
sources of constraints that exist in the human communication process.
Regardless of the specific recognition approach, present-day speech
recognition systems, including our own, do not fully utilize the
constraints that exist in the speech communication process. As an
example, consider the representation of words in the lexicon. Most
systems nowadays represent words in terms of subword units such as
phonemes or phones, and they are modelled based on the context in
which they appear. Such a mapping from (context-dependent) phonemes
directly to words ignores the intermediate linguistic levels that
exert influence on the acoustic realization of words. For example,
the duration of the /p/ release differs significantly for the two
words ``peak'' and ``speak,'' due to their syllable structures, but the
same acoustic differences for ``display'' and ``misplace'' can only be
identified through morphological analysis. Similarly, the homorganic
rule stating that nasal-stop combinations must have the same place of
articulation operates when they are both in the coda position, as in
``think'', but not when the stop is in the affix position, as in
``dreamt.'' Examples such as these can be found at many intermediate
levels between the acoustic signal and words in the lexicon. Proper
utilization of these constraints has the potential of greatly
increasing the word accuracy while decreasing the reliance on
word-level statistical language models.
AREA REFERENCES
C.-H. Lee, F. Soong, and K. Paliwal (eds.), Automatic Speech and
Speaker Recognition: Advanced Topics, Kluwer Academic Publishers,
1996.
J. Perkell and D. Klatt (eds.), Invariance and Variability in Speech
Processes, Lawrence Erlbaum Associates, 1986.
L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition,
Prentice Hall, 1993.
U. Frauenfelder and L. Tyler (eds.), Spoken Word Recognition, MIT Press, 1987.
A Waibel And K.-F. Lee (eds.), Readings in Speech Recognition, Morgan
Kaufmann, 1990.
RELATED PROGRAM AREAS
1. Virtual Environments.
3. Other Communication Modalities.
4. Adaptive Human Interfaces.
6. Intelligent Interactive Systems for Persons with Disabilities.