¹ CAIP Center, Rutgers University
² Bell Laboratories - Lucent Technologies
³ General Dynamics Electric Boat Company
Natural spoken language is a preferred means for communication among humans. Because of advances in automatic speech recognition, in speech synthesis, and in the computational economies of microelectronics, natural spoken language is emerging as an effective means for human/machine communication. But, as yet, the fundamental understanding supports application only in narrow, specific tasks. This limitation is owing, in part, to lack of detailed knowledge about how speech signals are generated, and how they can be described quantitatively. Too, it is owing to inadequate computational models of languages. In both instances, research progress has heretofore been hampered by the unavailability of high-performance computing necessary to support analysis of the basic physics of speech generation. As a result, speech recognizers are fragile in performance and unduly susceptible to interference, and speech synthesizers produce a signal quality that is far short of human naturalness. New understanding -- that can diminish these limitations -- can now be gained through computing capabilities that are emerging and through the accumulated knowledge of the speech production mechanism.
This research characterizes the generation of speech signals in terms of (a) an articulatory description of the vocal system, and (b) a fluid-dynamic solution to the generation, propagation, and radiation of audible sound produced by the acoustic system. Included is a computation of the speech signal from first principles, using the Navier-Stokes description of fluid flow. Preliminary research has demonstrated the feasibility of the approach. Computational methods are within reach for realistically characterizing the non-linearities involved in voiced-sound generation by the vocal cords, voiceless-fricative generation from turbulent flow at constrictions, and resonance and radiation effects conditioned by sound travel in a non-uniform, lossy, yielding-wall conduit (i.e., the human vocal tract).
First indications are that replication of the basic physics of sound generation can lead to synthesis speech of improved naturalness, and hence to the possibility of machine voices of higher quality. Also, preliminary indications suggest that articulatory description of speech information can aid robustness to variability, and may lead to speech recognizers that are less susceptible to interference.
This research initially formulates software for computing sound pressures and velocities in a two-dimensional vocal tract, via Navier-Stokes solutions on a dense time-space grid. The articulatory shape is prescribed, and in initial studies, is non-time varying. The results permit characterizing acoustic interaction between sound sources and resonator system, and permit identifying the dominant parameters that condition turbulent flow and chaotic pressure generation. The research will subsequently address three-dimensional time-varying articulatory shapes. Throughout, interactive auditory assessment of synthesized signals is made in quantitative listening tests. Parsimonious modeling of articulatory shape and vocal dynamics is sought as a new parameterization of speech signals.
The expected result is speech synthesis of quality and naturalness surpassing that yet achieved in synthetic machine voices, and a new, potentially-robust parameterization of speech for application in speech recognition and low bit rate coding.
Chennoukh, S., Sinder D., Richard, G., Flanagan, J., Voice Mimic System Using Articulatory Codebook for Estimation of Vocal Tract Shape, Proceedings Eurospeech-97, Patras, Greece, September 1997 (in press).
Chennoukh, S., Sinder, D., Richard, G., Flanagan, J , Methods for Acoustic-to-Articulatory Mapping and Voice Mimic Systems, J. Acoust. Soc. Am. v.101, no.5, pt 2, 3179(A), May 1997.
Richard, G., Goirand, M., Sinder, D., Flanagan, J., Simulation and Visualization of Articulatory Trajectories Estimated from Speech Signals, Proceedings International Symposium on Simulation, Visualization and Auralization for Acoustic Research and Education, pp431-438, Tokyo, Japan, April 1997.
Sinder, D., Krane, M., Chennoukh, S., Richard, G., Flanagan, J., Levinson, S., Slimon, S., Davis, D., Fluid Dynamic Studies of Speech Production, J. Acoust. Soc. Am. v.101, no.5, pt 2, 3179(A), May 1997.
Sinder, D., Richard, G., Duncan, H., Flanagan, J., Slimon, S., Davis, D., Krane, M., Levinson, S., Flow Visualization in Stylized Vocal Tracts, Proceedings of the International Symposium on Simulation Visualization and Auralization for Acoustic Research and Education, pp439-444, Tokyo, Japan, April 2-4, 1997.
Slimon, S., Davis, D., Levinson, S., Krane, M., Richard, G, Sinder, D., Duncan, H., Lin, Q., Flanagan, J., Low Mach Number Flow Through a Constricted, Stylized Vocal Tract, Proceedings American Institute of Aeronautics and Astronautics Conference (AIAA-96), Penn State Univ., PA., May 1996.
Sinder, D., Richard, G., Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., Slimon, S., A Fluid Flow Approach to Speech Generation, Proceedings of First ESCA Tutorial and Research Workshop on Speech Production Modeling: From Control Strategies to Acoustics, European Speech Communication Association, Autrans, France, May 1996.
Levinson, S., Flanagan, J., Davis, D., Krane, M., Richard, G., Slimon, S. Kubli, R., Sinder, D., Coker, C., Studying the Effects of Fluiud Dynamics on Speech Production, Proceedings International Symposium on Simulation Visualization and Auralization for Acoustic Research and Education, pp31-44, Tokyo, Japan, April 1997.
Flanagan, J., Computer Modeling of Acoustic Systems, International Symposium on Simulation, Visualization and Auralization for Acoustic Research and Education, pp225-232, Tokyo, Japan, April 1997.
Richard, G., Liu, M., Sinder, D., Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., and Slimon, S., Vocal tract simulations based on fluid dynamic analysis, J. Acous. Soc. America, vol. 97, no. 2, pt.2, May 1995.
Richard, G., Liu, M., Sinder, D., Duncan, H., Lin, Q., Flanagan, J., Levinson, S., Davis, D., and Slimon, S., Numerical simulations of fluid flow in the vocal tract, Proceedings EUROSPEECH-95, pp. 1297-1300, Madrid, Spain, , 1995.
Flanagan, J., Human Communication by Speech, Encyclopedia of Acoustics, Chapter 124, vol.4, pp.1557-1564, John Wiley & Sons, Inc. (1997).
As computers and their capabilities grow in sophistication, the design of the user interface expands in importance. An important issue is to match the interface to the information processing capacities of the human user. A preferred means for information exchange for humans is natural speech, heightening the incentive for fluent conversation between human and machine. But neither speech synthesis (whereby the machine generates and speaks intelligent messages in high-quality natural-sounding voice) nor speech recognition (whereby the machine hears and understands fluent commands) have advanced beyond supporting narrow task-specific functions. In part, this is because of a lack of fundamental understanding of the physics of how speech is generated, and in part owing to inadequate computational models of language. In both instances, research progress has been thwarted heretofore by the unavailability of high-performance computing needed to characterize and explore the problems.
Recent exploratory work suggests that, with adequate computation, improved generation of speech can be achieved from detailed computations of the fluid-dynamic description of vocal flow, along with time-varying articulatory description of system boundary conditions. That is, a formulation of speech generation based upon the Navier-Stokes description of fluid dynamics covers the conditions of both voiced-sound generation from the vibrating vocal cords and voiceless-fricative generation from turbulent flow. The computations on a dense time-space grid, though enormous, are within reach of now-emerging high-performance computing capabilities.
We therefore are pursuing a multidisciplinary research approach to the fundamental understanding of speech generation -- one that combines the expertise of fluid dynamics, physical acoustics, linguistics, and signal processing, along with advances in high-performance computing.
Flanagan, J. L. (1972). Speech Analysis Synthesis and Perception, Springer-Verlag, New York.
Flanagan, J. L., Ishizaka, K., and Shipley, K. L. (1975). Synthesis of Speech from a Dynamic Model of the Vocal Cords and Vocal Tract, Bell Sys. Tech. J. 544, 485-506.
Thomas, T. J. (1986). A Finite Element Model of Fluid Flow in the Vocal Tract, Computer Speech and Language, 1, 131-151.
Baer, T., Gore, J. C., Gracco, L. C., and Nye, P. W. (1991). Laryngeal Ultrasonography in Infants and Children: Pathological Findings, Pediatr. Radiol., 21, 164-167.
Hardin, J. C. (1992). Overview of Computational Aeroacoustics in Aerodynamics, Lecture Notes, AIAA Professional Study Series Seminar, Reno, NV.
Stevens, K. N. (1971): Airflow and turbulence noise for fricative and stop consonants; Static considerations, J. Acoust. Soc. Am. 50, pp. 1180-1192.
Adaptive Human Interfaces.
Usability and User-Centered Design.
Intelligent Interactive Systems for Persons with Disabilities.
Donald G. Childers (University of Florida): Interactive Model of the Vocal Folds.
James Glass, Stephanie Seneff and Victor Zue (MIT): A Hierarchical Framework for Speech Recognition and Understanding