Spoken language understanding involves two primary component technologies (each covered elsewhere in this volume): speech recognition (SR), and natural language (NL) understanding. The integration of speech and natural language has great advantages: To NL, SR can bring prosodic information (information important for syntax and semantics but not well represented in text); NL can bring to SR additional knowledge sources (e.g., syntax and semantics). For both, integration affords the possibility of many more applications than could otherwise be envisioned, and the acquisition of new techniques and knowledge bases not previously represented. The integration of these technologies presents technical challenges, and challenges related to the quite different cultures, techniques and beliefs of the people representing the component technologies.
In large part, NL research has grown from symbolic systems approaches in computer science and linguistics departments. The desire to model language understanding is often motivated by a desire to understand cognitive processes, and therefore the underlying theories tend to be from linguistics and psychology. Practical applications have been less important than increasing intuitions about human processes. Therefore, coverage of phenomena of theoretical interest (usually the more rare phenomena) has traditionally been more important than broad coverage.
Speech recognition research, on the other hand, has largely been practiced in engineering departments. The desire to model speech is often motivated by a desire to produce practical applications. Techniques motivated by knowledge of human processes have therefore been less important than techniques that can be automatically developed or tuned, and broad coverage of a representative sample is more important than coverage of any particular phenomenon.
There are certainly technical challenges to the integration of SR and NL. However, progress toward meeting these challenges has been slowed by the differences outlined above. Collaboration can be inhibited by differences in motivation, interests, theoretical underpinnings, techniques, tools, and criteria for success. However, both groups have much to gain from collaboration. For the SR engineers, human language understanding provides an existence proof, and needs to be taken into account, since most applications involve interaction with at least one human. For the AI NL researchers, statistical and other engineering techniques can be important tools for their inquiries.
A survey of the papers on SR and NL in the last 5 to 10 years indicates that there is growing interest in the use of engineering techniques in NL investigations. Although the use of linguistic knowledge and techniques in engineering seems to have lagged, there are signs of growth as engineers tackle the more abstract linguistic units. These units are more rare, and therefore more difficult to model by standard, data-hungry engineering techniques.
Evaluation of spoken language understanding systems (see chapter 13) is required to estimate the state of the art objectively. However, evaluation itself has been one of the challenges of spoken language understanding. A brief survey of spoken language understanding work in the Europe, Japan and the U.S. is surveyed briefly below, and evaluation will be discussed in the following section.
Several sites in Canada, Europe and Japan have been researching spoken language understanding systems, including INRS in Canada, LIMSI in France, KTH in Sweden, the Center for Language Technology in Denmark, SRI International and DRA in the UK, Toshiba in Japan. The five year ESPRIT SUNDIAL project, which concluded in August 1993, involved several sites and the development of prototypes for train timetable queries in German and Italian and flight queries in English and French. All these systems are described in articles in [Eur93]. The special issue of Speech Communication on Spoken Dialogue [SF94], also includes several system descriptions, including those from NTT, MIT, Toshiba, and Canon.
In the ARPA program, the air travel planning domain has been chosen to
support evaluation of spoken language systems
[Pal91,Pal92,PDF
92,PFFG90,PFFG93,PFF
94,PFF
95].
Vocabularies for these systems are usually about 2000
words. The speech and language are spontaneous, though fairly planned
(since people are typically talking to a machine rather than to a
person, and often use a push to talk button). The speech
recognition utterance error rates in the December 1994 benchmarks was
about 13% to 25%. The utterance understanding error rates range from
6% to 41%, although about 25% of the utterances are considered
unevaluable in the testing paradigm, so these figures do not
consider the same set [Pal91,Pal92,PDF
92,PFFG90,PFFG93,PFF
94,PFF
95].
It may be that for limited domains, these error rates are compatible
with many potential applications. Since conversational repairs in
human-human dialogue can often be in the ranges observed for these
systems, the bounding factor in applications may be not the error
rates so much as the ability of the system to manage and recover from
errors.
The benchmarks for spoken language understanding involve spontaneous speech input usually involving a real system, and sometimes with a human in the loop. The systems are scored in terms of the correctness of the response from the common database of information including flight and fare information. Performing this evaluation automatically requires human annotation to select the correct answer, define the minimal and maximal answers accepted, and to decide whether the query is ambiguous and/or answerable. The following sites participated in the most recent benchmarks for spoken language understanding: AT&T Bell Laboratories, Bolt Beranek and Newman, Carnegie Mellon University, Massachusetts Institute of Technology, MITRE, SRI International, and Unisys. Descriptions of these systems appear in [ARP95b].
There is a need to reduce the costs of evaluation, and to improve the quality of evaluations. One limitation of the current methodology is that the evaluated systems must be rather passive since the procedure does not generally allow for responses that are not a database response. This means that the benchmarks do not assess an important component of any real system: its ability to guide the user and to provide useful information in the face of limitations of the user or of the system itself. This aspect of the evaluation also forces the elimination of a significant portion of the data (about 25% in the most recent benchmark). Details on evaluation mechanisms are included in chapter 13. Despite the imperfections of these benchmarks, the sharing of ideas and the motivational aspects of the common benchmarks have yielded a great deal of technology transfer and communication.
The integration of SR and NL in applications is faced with many of the same challenges that each of the components face: accuracy, robustness, portability, speed, and size, for example. However, the integration also gives rise to some new challenges as well, including: integration strategies, coordination of understanding components with system outputs, the effective use in NL of a new source of information from SR (prosody, in particular), and the handling of spontaneous speech effects (since people do not speak the way they write). Each of these areas will be described briefly below.
Several mechanisms for the communication among components have been explored. There is much evidence that human speech understanding involves the integration of a great variety of knowledge sources, including knowledge of the world or context, knowledge of the speaker and/or topic, lexical frequency, previous uses of a word or a semantically related topic, facial expressions, prosody, in addition to the acoustic attributes of the words. In SR, tighter integration of components has consistently led to improved performance, and tight integration of SR and NL has been a rather consistent goal. However, as grammatical coverage increases, standard NL techniques can become computationally difficult. Further, with increased coverage, NL tends to provide less constraint for SR.
The simplest approach of integration is simply to concatenate an existing speech recognition system and an existing NL system. However, this is suboptimal for several reasons. First, it is a very fragile interface and any errors that might be in the speech recognition system are propagated to the NL system. Second, the speech system does not then have a chance to take advantage of the more detailed syntactic, semantic and other higher level knowledge sources in deciding on what the words are. It is well known that people rely heavily on these sources in deciding what someone has said.
Perhaps the most important reason for the suboptimality of a simple concatenation is the fact that the writing mode differs greatly from the speaking mode. In the written form, people can create more complex sentences than in the spoken form because they have more time to think and plan. Readers have more time than do listeners to think and review, and they have visual cues to help ascertain the structure. Further, most instances of written text are not created in an interactive mode. Therefore, written communications tend to be more verbose than verbal communications. In non-interactive communications, the writer (or speaker in a non-interactive monologue) tries to foresee what questions a reader (or listener) may have. In an interactive dialogue, a speaker can usually rely on the other participant to ask questions when clarification is necessary, and therefore it is possible to be less verbose.
Another important difference between the written and spoken mode is that the spoken mode is strictly linear. A writer can pause for days or months before continuing a thought, can correct typos, can rearrange grammatical constructions and revise the organization of the material presented without leaving a trace in the result the reader sees. In spoken language interactions, every pause, restart, revision and hesitation has a consequence available to the listener. These effects are outlined further in the section below on spontaneous speech.
The differences between speaking and writing are compounded by the fact that most NL work has focussed on the written form, and if spoken language has been considered, except for rare examples such as [Hin83], it has largely been based on intuitions about the spoken language that would have occurred if not for the noise of spontaneous speech effects. As indicated in the overview, coverage of interesting linguistic phenomena has been a more important goal than testing coverage on occurring samples, written or spoken. More attention has been paid to correct analyses of complete sentences than to methods for recovery of interpretations when parses are incomplete (with the exceptions of some robust parsing techniques which still require a great deal more effort before they can be relied on in spoken language understanding systems (see section 3.7).
Because of the differences between speaking and writing, statistical models based on written materials will not match spoken language very well. Because of the fact that NL analyses have been predominantly based on complete parsing of grammatically correct sentences (based on intuitions of grammaticality of written text), traditional NL analyses often do very poorly when faced with transcribed spontaneous speech. Further, very little work has considered spontaneous effects. In sum, in general, simple concatenation of existing modules does not tend to work very well.
To combat the mismatch between existing SR and NL modules, two trends have been observed. The first is an increased use of semantic (as opposed to syntactic grammars) (see section 3.6). Such grammars rely on finding an interpretation without requiring grammatical input (where grammatical may be interpreted either in terms of traditional text-book grammaticality, or in terms of a particular grammar constructed for the task). Because semantic grammars focus on meaning in terms of the particular application, they can be more robust to grammatical deviations (see section 3.6). The second observed trend is the n-best interface. In the face of cultural and technical difficulties related to a tight integration, n-best integration has become popular. In this approach, the connection between SR and NL can be strictly serial: one component performs its computation, sends it to another component and that result is sent to yet another module. The inherent fragility of the strictly serial approach is mitigated by the fact that SR sends NL not just the best hypothesis from speech recognition, but the n-best (where N may be on the order of 10 to 100 sentence hypotheses). The NL component can then score hypotheses for grammaticality and/or use other knowledge sources to determine the best-scoring hypothesis. Frequently, the more costly knowledge sources are saved for this rescoring. More generally, there are several passes, a progressive search in which the search space is gradually narrowed and more knowledge sources are brought to bear. This approach is computationally tractable, and accommodates great modularity of design. The (D)ARPA, ESCA Eurospeech and ICSLP proceedings over the past several years contain several examples of the n-best approach and ways of bringing higher level knowledge sources to bear in SR [DAR90,DAR91,DAR92,ARP93,ARP94,ARP95a,Eur89,Eur91,Eur93,ICS90,ICS92,ICS94] . In addition, the special issue of Speech Communication on Spoken Dialogue [SF94] contains several contributions investigating the integration of SR and NL.
With few exceptions, current research in spoken language systems has focused on the input side; i.e., the understanding of spoken input. However, many if not most potential applications involve a collaboration between the human and the computer. In many cases, spoken language output is an appropriate means of communication that may or may not be taken advantage of. Telephone-based applications are particularly important, since their use in spoken language understanding systems can make access to crucial data as convenient as the nearest phone, and since voice is the natural and (except for the as yet rare video-phones) usually the only modality available. Spoken outputs are also crucial in speech translation. The use of spoken output technologies, covered in more detail in chapter 5, is an important challenge to spoken language systems. In particular, we need reliable techniques to:
Since people tend to be very cooperative in conversation, a system should not output structures it is not capable of understanding. By coordinating inputs and outputs the system can guide the user toward usage better adapted to the particular system. Not doing so can be very frustrating for the user.
Prosody can be defined as the suprasegmental information in speech; that is, information that cannot be localized to a specific sound segment, or information that does not change the segmental identity of speech segments. For example, patterns of variation in fundamental frequency, duration, amplitude or intensity, pauses, and speaking rate have been shown to carry information about such prosodic elements as lexical stress, phrase breaks, and declarative or interrogative sentence form. Prosody consists of a phonological aspect (characterized by discrete, abstract units) and a phonetic aspect (characterized by continuously varying acoustic correlates).
Prosodic information is a source of information not available in text-based systems, except insofar as punctuation may indicate some prosodic information. Prosody can provide information about syntactic structure, it can convey discourse information, and it can also relay information about emotion and attitude. Surveys of how this can be done appear in [PO95,SF94,ESC93].
Functionally, in languages of the world, prosody is used to indicate segmentation and saliency. The segmentation (or grouping) function of prosody may be related more to syntax (with some relation to semantics), while the saliency or prominence function may play a larger role in semantics than in syntax. To make maximum use of the potential of prosody will require tight integration, since the acoustic evidence needs to inform abstract units in syntax, semantics, discourse, and pragmatics.
The same acoustic attributes that indicate much of the prosodic structure (pitch and duration patterns) are also very common in aspects of spontaneous speech that seem to be more related to the speech planning process than to the structure of the utterance. For example, an extra long syllable followed by a pause can indicate either a large boundary that may be correlated with a syntactic boundary, or that the speaker is trying to plan the next part of the utterance. Similarly, a prominent syllable may mean that the syllable is new or important information, or that it replaces something previously said in error.
Disfluencies (e.g., um, repeated words, and repairs or false starts) are common in normal speech. It is possible that these phenomena can be isolated, e.g., by means of a posited edit signal, by joint modeling of intonation and duration, and/or by models that take into account syntactic patterns. However, modeling of speech disfluencies is only beginning to be modeled in spoken language systems. Two recent Ph.D. theses survey this topic [Lic94,Shr94].
Disfluencies in human-human conversation are quite frequent, and a normal part of human communication. Their distribution is not random, and in fact may be a part of the communication itself. Disfluencies tend to be less frequent in human-computer interactions than in human-human interactions. However, the reduction in occurrences of disfluencies may be due to the fact that people are as yet not comfortable talking to computers. They may also be less frequent because there is more of an opportunity for the speaker to plan, and less of a potential for interruption. As people become increasingly comfortable with human-computer interactions and concentrate more on the task at hand than on monitoring their speech, disfluencies can be expected to increase. Speech disfluencies are a challenge to the integration of SR and NL since the evidence for disfluencies is distributed throughout all linguistic levels, from phonetic to at least the syntactic and semantic levels.
Although there have been significant recent gains in spoken language understanding, current technology is far from human-like: only systems in limited domains can be envisioned in the near term, and the portability of existing techniques is still rather limited. Application areas that appear to be a good match to technology on the near horizon include those that are naturally limited, for example database access (probably the most popular task across languages). With the rise in cellular phone use, and as rapid access to information becomes an increasingly important economic factor, telephone access to data and telephone transactions will no doubt rise dramatically. Mergers of telecommunications companies with video and computing companies will also no doubt add to the potential for automatic speech understanding.
While such short-term applications possibilities are exciting, if we can successfully meet the challenges outlined in previous sections, we can envision an information revolution on par with the development of writing systems. Spoken language is still the means of communication used first and foremost by humans, and only a small percentage of human communication is written. Automatic spoken language understanding can add to the many benefits of the spoken language many of the advantages normally associated only with text: random access, sorting, and access at different times and places. Making this vision a reality will require significant advances in the integration of SR and NL, and, in particular, the ability to better model prosody and disfluencies.