Sharon Oviatt
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA
To date, the development of spoken language systems
primarily has been a technology-driven phenomenon. As speech recognition has improved, progress traditionally has been documented
in the reduction of word error rates
[PFF
94]. However, reporting word error rate fails to
express the frustration typically experienced by users who
cannot complete a task with current speech technology
[RW93]. Although the successful design of
interfaces is essential to supporting usable spoken language systems, research on human-computer spoken interaction currently represents a gap in our scientific
knowledge. Moreover, this gap is widely recognized as having
generated a bottleneck in our ability to deploy robust speech technology in actual field settings.
Among other challenges, interfaces will be needed that can guide
users' spontaneous speech to coincide with system
capabilities, since spontaneous speech is known to be particularly
variable along a number of linguistic dimensions
[CHA
95]. Interface techniques for
successfully constraining spoken input have been studied most
extensively by the telecommunications industry as it strives to
automate operator services [KD91,Spi91].
Such work has emphasized the need for realistic and situated user testing, often in field
settings, and has shown that dramatic variation can occur in the
successful elicitation of target speech depending on the type of
system prompt.
Other research has demonstrated that the principle of linguistic convergence, or the tendency of people's speech patterns to gravitate toward those of their interactive partner, can be employed to guide wordiness, lexical choice, and grammatical structure during human-computer spoken interactions, and without imposing any explicit constraints on user behavior [ZF91]. In addition, research has shown that difficult sources of variability in human speech (e.g., disfluencies, syntactic ambiguity) can be reduced by a factor of 2-to-8 fold through alteration of interface parameters [Ovi95,OCW94]. Such work demonstrates the potential impact that interface design can have on managing spoken input, although interface techniques have been underexploited for this purpose. In all of these areas, research typically has involved proactive performance assessment using simulation techniques, which is the preferred method of conducting evaluations of systems in the planning stages.
Many basic issues need to be addressed before technology can leverage
fully from the natural advantages of speech---including the
speed, ease, spontaneity, and
expressive power that people experience when using it during
human-human communication. For example, research is needed
to evaluate different types of natural spoken dialogue,
spontaneous speech characteristics and their management, and
dimensions of human-computer interactivity that influence
spoken communication. With respect to the latter, research is
especially needed on optimal delivery of system confirmation feedback, error patterns and their resolution, flexible
regulation of conversational control, and management of
users' inflated expectations of the interactional
coverage of spoken language systems. In addition, the
functional role that ultimately is most suitable for speech
technology needs to be evaluated further. Finally, assessment is
needed of the potential usability advantages of multimodal systems incorporating speech over unimodal speech systems, with
respect to breadth of utility, ease of error handling, learnability, flexibility, and overall
robustness [CO94,CHA
95]. To support
all of these research agendas, tools will be needed for building and
adapting high quality, semiautomatic simulations. Such an
infrastructure can be used to evaluate the critical performance
tradeoffs that designers will encounter as they strive to design more
usable spoken language systems.