Chapter 1: Spoken Language Input
& Ron Cole
MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA
Spoken language interfaces to computers is a topic that has lured and fascinated engineers and speech scientists alike for over five decades. For many, the ability to converse freely with a machine represents the ultimate challenge to our understanding of the production and perception processes involved in human speech communication. In addition to being a provocative topic, spoken language interfaces are fast becoming a necessity. In the near future, interactive networks will provide easy access to a wealth of information and services that will fundamentally affect how people work, play and conduct their daily affairs. Today, such networks are limited to people who can read and have access to computers---a relatively small part of the population even in the most developed countries. Advances in human language technology are needed for the average citizen to communicate with networks using natural communication skills using everyday devices, such as telephones and televisions. Without fundamental advances in user-centered interfaces, a large portion of society will be prevented from participating in the age of information, resulting in further stratification of society and tragic loss in human potential.
The first chapter in this survey deals with spoken language input technologies. A speech interface, in a user's own language, is ideal because it is the most natural, flexible, efficient, and economical form of human communication. The following sections summarize spoken input technologies that will facilitate such an interface.
Spoken input to computers embodies many different technologies and
applications, as shown in Figure 1.1. In some cases, as
shown at the bottom of the figure, one is interested not in the underlying
linguistic content, but the identity of the speaker, or the language
being spoken. Speaker recognition can involve identifying a
specific speaker out of a known population, which has forensic
implications, or verifying the claimed identity of a user,
thus enabling controlled access to locales (e.g., a computer room)
and services (e.g., voice banking). Speaker recognition
technologies are addressed in section
.
Language identification also has important applications, and techniques
applied to this area are summarized
in section 8.7.
When one thinks about speaking to computers, the first image is usually
speech recognition, the conversion of an acoustic signal to a stream of
words. After many years of research, speech recognition technology is
beginning to pass the threshold of practicality. The last decade
has witnessed dramatic improvement in speech recognition technology, to
the extent that high performance algorithms and systems are becoming
available. In some cases, the transition from laboratory demonstration
to commericial deployment has already begun. Speech input capabilities
are emerging that can provide functions like voice dialing
(e.g., Call home), call routing
(e.g., I would like to make a collect call),
simple data entry (e.g., entering a credit card number), and preparation
of structured documents
(e.g., a radiology report). The basic issues
of speech recognition, together with a summary of the state-of-the-art,
is described in section
. As these authors point out,
speech recognition involves several component technologies. First,
the digitized signal must be transformed into a set of
measurements. This signal representation issue is elaborated in
section
. Section
discusses techniques that enable the system to achieve robustness in the
presence of transducer and environmental variations, and techniques for
adapting to these variations. Next, the various speech sounds must be
modeled appropriately. The most widespread technique for acoustic
modeling is called hidden Markov modeling (HMM), and is the subject of
section
. The search for the final answer involves the
use of language constraints, which is covered in
section
.
Figure 1.1: Technologies for spoken language interfaces.
Speech recognition is a very challenging problem in its own right, with a
well defined set of applications. However, many tasks that lend
themselves to spoken input---making travel arrangements or selecting a
movie---are in fact exercises in interactive problem solving. The
solution is often built up incrementally, with both the user and the
computer playing active roles in the ``conversation." Therefore, several
language-based input and output technologies must be developed and
integrated to reach this goal. The remainer of Figure 1.1 shows the
major components of a typical conversational system. The spoken input is first
processed through the speech recognition component. The natural language
component, working in concert with the recognizer, produces a meaning
representation. The final section of this chapter, on spoken language
understanding technology (section
),
discusses the integration of speech recognition
and natural language processing techniques.
For information retrieval applications illustrated in this figure, the meaning representation can be used to retrieve the appropriate information in the form of text, tables and graphics. If the information in the utterance is insufficient or ambiguous, the system may choose to query the user for clarification. Natural language generation and speech synthesis, covered in chapters 4 and 5, respectively, can be used to produce spoken responses that may serve to clarify the tabular information. Throughout the process, discourse information is maintained and fed back to the speech recognition and language understanding components, so that sentences can be properly understood in context.