Speech Technology and Research Laboratory
SRI International
In order to use natural language processing (NLP) successfully with speech recognition, the recognizer needs to provide more than just a stream of words. For spontaneous speech in particular, the following types of events need to be identified:
The primary goal of this project is to augment standard speech recognition models to enable recognizers to output sequences annotated for such events. Inspection reveals that these phenomena can be represented as non-overt events occurring between words. For example, turns and utterances can be delimited by events occurring at inter-word junctures. Similarly, disfluencies and discourse elements can be demarcated by events surrounding the words involved. Because such phenomena 1) are not overt in the word sequence; and 2) can be delimited using inter-word event representations, we refer to them collectively as Hidden Word-level Events or simply HWEs.
The approach is to develop more comprehensive speech models that will allow the automatic recognition and classification of HWEs to occur in tandem with standard word recognition. HWE recognition will be based on a combination of acoustic and language models, extending the standard components found in current systems. New models will capture the specific prosodic characteristics of HWEs, such as intonation and duration patterns. Information from prosodic features will be combined with statistical language models that describe the distribution of HWEs in relation to words, parts-of-speech, and other syntactic and lexical units.
The integrated modeling of HWEs distinguishes the proposed work from past efforts which have been based on post-processing of word-level information. The integrated approach promises to yield better results than post-processing techniques because it uses additional information. Furthermore, acoustic information may be more reliable than word-level information, which is subject to speech recognition errors.
A second goal of our research is to improve word recognition itself, since the combined word/HWE models are expected to be superior to those used by standard recognizers. For example, with the help of an HWE representation, a recognizer would be able to discount hypotheses for which information from prosodic features is inconsistent with the hypothesized word/HWE sequence.
In addition to its use in the development of speech recognition algorithms, automatic HWE labeling has great potential practical benefit for speech research in general. The techniques to be developed will enable automatic labeling of large corpora of spontaneous speech, reducing the need for human annotators and benefiting other areas of speech and language research.
E. Shriberg, R. Bates, and A. Stolcke (1997), A Prosody-Only Decision-Tree Model for Disfluency Detection. To appear in Proc. EUROSPEECH, Rhodes, Greece.
A. Stolcke & E. Shriberg (1996), Automatic linguistic segmentation of conversational speech. Proc. Intl. Conf. on Spoken Language Processing, 1005-1008, Philadelphia, PA.
E. Shriberg & A. Stolcke (1996), Word predictability after filled pauses: A corpus-based study. Proc. Intl. Conf. on Spoken Language Processing, 1868-1871, Philadelphia, PA.
A. Stolcke & E. Shriberg (1996), Statistical language modeling for speech disfluencies. Proc. Intl. Conf. on Acoustics, Speech and Signal Processing, 405-409, Atlanta, GA.
Current speech recognition technology is focused on transcribing spoken input into a sequence of words. Natural language processing (NLP) on the other hand, is concerned with the parsing, understanding and indexing of transcribed utterances and larger linguistic units. In a fully automatic spoken language system, the output of the recognizer typically serves as input to the NLP component. At present however, a significant gap remains between these two technologies, particularly for the processing of spontaneous speech.
Most NLP techniques have been developed for spoken input resembling read or highly constrained speech. When applied to spontaneous speech, such techniques encounter at least two main difficulties. First, spontaneous speech contains many surface phenomena relating to non-propositional aspects of the input. These include disfluencies (hesitations, repairs, and restarts), discourse markers ("well"), and other elements. Second, in spontaneous speech there is no overtly marked punctuation available for segmenting the input into meaningful units such as utterances. For optimal NLP performance, these types of phenomena should be annotated in the input; current speech recognizers, however, produce only a raw sequence of words.
P. A. Heeman and J. Allen (1994), Detecting and correcting speech repairs. Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, 295-302. Las Cruces, NM.
D. J. Litman and R. J. Passonneau (1995), Combining multiple knowledge sources for discourse segmentation. Proc. 33th Annual Meeting of the Association for Computational Linguistics, Cambridge, MA.
C. H. Nakatani and J. Hirschberg (1994), A corpus-based study of repair cues in spontaneous speech. Journal of the Acoustical Society of America, 95(3), 1603-1616.
D. O'Shaughnessy (1994), Correcting complex false starts in spontaneous speech. Proc. ICASSP, Vol. I, 349-352. Adelaide, Australia.
D. D. Palmer and M. A. Hearst (1994), Adaptive Sentence Boundary Disambiguation. Proc. Conference on Applied Natural Language Processing. Stuttgart, Germany.
P. Price (1996), Spoken Language Understanding, in R. A. Cole (ed.), Survey of the State of the Art in Human Language Technology, Center for Spoken Language Understanding, Oregon Graduate Institute.
H. Sacks, E. A. Schegloff, and G. Jefferson (1974), A simplest semantics for the organization of turn-taking in conversation. Language, 50(4), 696-735.