Postscript Version
bgcolor="FFFFFF">
Computer Science Department
University of
Rochester
This project focuses on techniques that provide additional information to a natural-language parser beyond the speech recognition output so that it can effectively handle the following problems in spontaneous dialogue: utterance segmentation (i.e., identifying utterance units); self-repair (i.e., identifying and realizing speech repairs); correction of speech recognition errors; and robust surface speech act interpretation from surface structure, lexical cues and intonation.
Our approach will combine traditional speech recognition techniques, simple models of prosodic feature detection, stochastic language models, and traditional parsing techniques. Initial work at Rochester has already shown very promising results in several of these areas in isolation. But to obtain the best performance we will need to integrate the component processes together in a final decision process that considers all factors simultaneously. We will do this by using stochastic techniques to predict likely recognition errors, speech repairs, utterance unit boundaries and prosodic features and then pass this information on to the parser which determines the best interpretation.
The results will be demonstrated in a series of empirical tests in two different settings. First we will run tests on the TRAINS-93 corpus of human-human dialogues. This corpus consists of eight hours of dialogue between two people interacting to solve simple freight scheduling problems. Second, we will incorporate the results into a full end-to-end dialogue system and measure how they improve system performance in a series of evaluation experiments. The system will be a successor to the TRAINS-95 system, which supports robust end-to-end human-computer dialogue for solving simple planning tasks.
Significant progress has been made in two general areas related to modeling spontaneous speech. One concerns the identification of utterance boundaries and and other phenomena in spoken dialogue, while the other concerns improving the speech recognition output by modeling the recognition errors that occur in the specific application.
Given speech input, a spoken language understanding system has to reconstruct the speaker's intended utterances: both segmenting the speaker's turn into utterance units and determining the intended words in each utterance. Even assuming perfect word recognition, the latter problem is complicated due to the occurrence of speech repairs, which occur where the speaker changes (or repeats) something he or she just said. The words that are replaced or repeated should no longer be part of the intended utterance. These two problems are also strongly intertwined with a third problem: identifying the discourse markers. Lexical items that can function as discourse markers, such as ``well,'' and ``okay'', are ambiguous as to whether they are introducing an utterance unit, signaling a speech repair, or are simply part of the content of an utterance. It is my thesis that these spontaneous speech phenomena must be resolved very early on in speech processing and in fact need to be incorporated in the language models used by speech recognizers.
In the past year, we have made significant progress towards modeling these phenomena in a statistical language model. We have recast our model for detecting and correcting speech repairs so that if a repair is hypothesized, this hypothesis can be used to predict the words that follow the interruption point of the repair.
The problem in recasting the detection and correction of speech repairs in the framework of a statistical language model is that we need to estimate probability distributions not just for the occurrence of a word given the previous context, but also the probability of part-of-speech tags, speech repairs, utterance boundaries, editing terms, and speech repair corrections. For estimating the probability of a word given the previous words (as with standard word-based language models), there are well established techniques for defining equivalence classes of contexts to deal with sparseness of data, such as mixing a bigram model with a trigram model. However, the context for estimating the probability distributions for our model are much more involved, and it is not clear how to hand-craft a set of equivalence classes of the context, nor how to define more general classes which can be mixed together. Our solution to this problem was to use decision trees, which use information theoretic techniques to decide how to partition the contexts into equivalence classes.
For using word and pos identities as part of the context used by the decision tree for estimating a probability distribution, we grow binary classification trees to cluster the words into hierarchical partitions. The decision tree can then ask questions about which partition a particular word or POS tag is in. Previous approaches for building classification (or clusters) for words use in the order of millions of words to build the clusters. However, the Trains corpus of spoken dialogues has less than 100,000 words. To grow classification trees with this limited size of corpus we make use of the POS tags and view word identities as a further refinement of the POS tags. Hence, we grow a classification tree for the POS tags and we grow a word classification tree for each POS tag. Not only are these trees easier to grow but they also avoid needless fragmentation by the decision tree algorithm since the word information is consistent with the POS information. With our improved modeling techniques, we are now able to show that by modeling the occurrence of speech repairs, boundary tones, and discourse markers, we are able to improve both word perplexity by around 10%. Given a word transcript of a dialog (thus assuming perfect speech recognition), we are able to improve POS tagging also by 10%, as well as detecting 70% of all utterance boundaries with a precision of 70%, detecting 76% of all speech repairs with a precision of 85%, and correcting 65% of all speech repairs with a precision of 72%.
Spontaneous speech is significantly harder to recognize than read or controlled speech. This problems is further amplified by the fact that we canttttttt possibly train specialized language and acoustic models for every speech recognition application. We are working on techniques to address this problem by correcting errors in a post-processing phase. We use a statistical process based on a noisy channel model. This also allows us to use a speech recognizer in new domains and under new conditions; the error-correcting post-processor adapts via supervised training.
The Speech Post-Processor (Speech-PP) uses a language model and a channel model to correct errors committed by a continuous speech recognizer. Correction results using a simple channel model describing one-for-one confusions were promising. The channel model was then augmented with an account of word fertility (one-for-two and two-for-one confusions). This allowed Speech-PP to make some improvements in correcting SR errors. Overall, this technique improves word accuracy rates by 5 to 10%.
We also augmented SpeechPP to forward SR hypotheses to the TRAINS parser, so as to allow the TRAINS parser to make final correction decisions using probabilistic evidence. We will evaluate this arrangement in the coming months. In addition to Speech-PP, other modules in TRAINS were updated to use frame numbers as indexes. This laid the groundwork for future handling of word lattices throughout the language components of the system.
James Allen, Brad Miller, Eric Ringger, and Teresa Sikorski. "A Robust System for Natural Spoken Dialogue." In Proceedings of the 1996 Meeting of the Association for Computational Linguistics (ACL). Santa Cruz, CA. June 1996.
Heeman, P. A., Loken-Kim, K., and Allen, J. F. (1996). Combining detection and correction of speech repairs. In Proceedings of the 4rd International Conference on Spoken Language Processing (ICSLP-96), pages 358--361, Philadelphia. Also appears in International Symposium on Spoken Dialogue, 1996, pages 133-136.
George Ferguson, James Allen, Brad Miller, and Eric Ringger. "The Design and Implementation of the TRAINS-96 System: A Prototype Mixed-Initiative Planning Assistant." TRAINS Technical Note 96-5. UR Computer Science Department. October 1996.
Eric Ringger and James Allen. "Robust Error Correction of Continuous Speech Recognition." In Proceedings of the 1997 ESCA-NATO Workshop on Robust Speech Recognition for Unknown Channels. Pont-a-Mousson, France. April 1997. To appear.
Eric Ringger and James Allen. "A Fertility Channel Model for Post-Correction of Continuous Speech Recognition." In Proceedings of the 1996 International Conference on Spoken Language Processing (ICSLP). Philadelphia, PA. October 1996.
Eric Ringger and James Allen. "Error Correction via a Post-Processor for Continuous Speech Recognition," In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Atlanta, GA. May 1996.
Traum, D. R., Schubert, L. K., Poesio, M., Martin, N. G., Light, M., Hwang, C. H.,
Heeman, P., Ferguson, G., and Allen, J. (1996). Knowledge representation in the TRAINS-93 conversation system. International Journal of Expert Systems, 9(1):173--223.
One of the most promising communication modes for human-machine interaction is spoken natural language. For many tasks, it will be the most intuitive and the efficient means of communication. In addition, it makes the machine accessible to users who do not have a sophisticated understanding of computers. Even in applications where other forms of communication (e.g., map displays, charts, menus) play a useful role, spoken language can significantly enhance the usability and efficiency of the interface by providing an intuitive organizational structure to the human-machine dialogue.
Recent progress in the speech recognition field suggests great promise for practical spoken language interfaces within the next decade. But this technology does not yet transfer effectively to many applications, mainly because of the lack of effective methods for accurately recognizing spontaneous speech and an inability to handle phenomena that are common in spontaneous dialogue, such as disfluencies, repairs and utterances that appear ungrammatical from the perspective of a grammar of written English.
James F. Allen, 1995. Natural Language Understanding, Second Edition, Benjamin Cummings.
Alex Waibel And Kai-Fu lee, 1990. Readings in Speech Recognition, Morgan Kaufmann.
Virtual Environments
Other Communication Modalities
Adaptive Human Interfaces
Intelligent Interactive Systems for Persons with Disabilities