Postscript Version

Modeling Speech Production: Formant and Articulatory Synthesis

Donald G. Childers

University of Florida
Dept. of Electrical and ComputerEngineering

CONTACT INFORMATION

Dept. of Electrical and Computer Engineering
P. O. Box 116130
429 Engr. Bldg. 33
University of Florida
Gainesville, FL 32611-6130

Email: childers@drwho.ee.ufl.edu OR childers@ece.ufl.edu
Tel: 352-392-2633
Fax: 352-392-0044

WWW PAGE

http://www.eel.ufl.edu/~childers/

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Speech synthesis, speech analysis, speech quality, vocal folds.

PROJECT SUMMARY

The project modeled laryngeal function and vocal tract characteristics using parameters extracted from the acoustic signal. This model has proved to be significant for synthesizing speech with various aspects of vocal quality. One purpose of the project was to develop an interactive model of phonatory and resonance characteristics of speech production. With such a model the researcher is able to test hypotheses about vocal quality by "varying" acoustical, anatomical, and physiological parameters of the model. The various features of the model may be validated by evaluating speech that is synthesized using the characteristics of the articulatory model. The advantages of this approach are that it is non-invasive and provides a dynamic model of glottal and vocal tract characteristics, and furthermore, the model may be evaluated and validated using speech synthesis techniques.

Some illustrative examples of results include the ability to relate aspects of vocal fold mass and length, aperiodicity of vocal fold motion, and glottal area to vocal quality factors. A unique application of the results is a possible training aid for the hearing impaired. Another potential long range application would be to develop a new speech coding procedure that would be based on segmenting speech according to phonetically related intervals that are synchronized with articulatory movement. This could provide a low bandwidth, high quality speech coding scheme.

Summary of Accomplishments

The major accomplishments of the project are the completion of a glottal excited linear prediction (GELP) speech synthesizer that uses a 6th order polynomial model of the glottal excitation. This model is represented in a GELP codebook. This synthesizer is easy to use and provides nearly perfect reproduction of the original speech. This work has been published; see the list of publications. We modeled source-tract interaction and modeled various voice types, including modal, vocal fry, and breathy. The latter study was statistical and determined relevant model waveform parameters for synthesizing these three voice types and for synthesizing source-tract interaction. The source-tract interaction and voice modeling work has been published as two papers, see the list of publications. Our work on analysis has resulted in 1) a new adaptive WRLS-VFF procedure for locating the instant of glottal closure and 2) an application of our speech synthesis techniques to voice conversion. These results have been published as two papers; see list of publications. The articulatory synthesizer is perhaps the primary accomplishment of this project. The articulatory synthesizer is an acoustic model of speech production that includes subglottal coupling, source-tract interaction, glottal impedance, the vocal tract, the nasal tract with two sinus cavities, and acoustic radiation. The major features of this synthesizer have been implemented in an interactive, user-friendly environment that allows the creation and display of synthetic speech. The data displays include the speech waveform and spectrum, the articulatory gestures for speech synthesis, the transfer function of the vocal tract, the pressure and volume-velocity waveforms at selected points in the vocal tract, excitation waveforms, and a noise source for fricatives, plosives, and affricates. \par Our implementation of the articulatory synthesizer derives the configuration of the vocal tract by analyzing a previously recorded speech signal. From the given signal we measure the formant frequencies on a frame by frame basis. These formants are called targets from which we determine the articulatory parameters using a simulated annealing algorithm that minimizes the error distance between the measured formant frequencies and the articulatory model-derived formant frequencies. The excitation generation phase models the excitation and noise sources for the synthesizer. The synthesis phase describes the articulatory-to-acoustic transformation to generate speech. Various examples of the articulatory speech synthesis system, including analysis, excitation, and synthesis phases have been reported in the annual progress reports.

Our articulatory model represents the lower part of the pharynx, the hyoid region, and the tongue-tip-to-jaw region. The vocal tract is represented by up to 60 sections with up to 60 kHz sampling frequency. The sagittal grid lines are oriented according to the position of the articulators to provide more reliable estimates of the vocal tract cross-sectional areas. The vocal tract is approximated by a set of non-uniform, lossy, soft wall, straight tubes with 60 concatenated elemental sections, which may be circular or elliptical. A transmission-line circuit model of the vocal system, which includes the vocal tract, the nasal tract with sinus cavities, the glottal impedance, the subglottal tract, an excitation source, and a turbulence noise source, is included in the synthesis model. Viscous and thermal losses and yielding wall vibration losses are included in the model.

The non-interactive excitation source includes both jitter and shimmer models along with the LF waveform model. For the interactive excitation source, we developed a new model that consists of a unified glottal excitation model, a subglottal model, and a glottal area model. The subglottal system was modeled by three cascaded RLC Foster circuits. For the turbulence noise source model a parallel flow source model was adopted for our implementation. The turbulence noise source can be located 1) at the center of, 2) immediately downstream from, 3) upstream from, or 4) spatially distributed along the constriction region. Various examples of this articulator model have been reported in the last annual progress report.

A simulated annealing algorithm was implemented to solve the speech inverse filtering problem. The constraints in the present work are provided by the articulatory-to-acoustic transformation function and the boundary conditions for the articulatory parameters. The articulatory vector defines the set of parameters to be optimized. The cost function is a percentage of the weighted least-absolute-value error distance. We minimize the distance between the first four formant frequencies of the model and the four formant frequencies measured from the target frame of the speech signal. A 1% error criterion was determined to be sufficient to generate natural vocal tract shapes and yet be practical computationally. Once the optimum articulatory vector is obtained, the articulatory model determines the vocal tract cross-sectional area function, which in turn is used by the articulatory speech synthesizer. The excitation phase constructs the excitation waveform model from the pitch contour. Finally, the synthesis phase synthesizes speech using the vocal tract cross-sectional area and the excitation waveform as the input. The vocal tract cross-sectional areas or the articulatory parameters can be interpolated between two consecutive target frames using a linear or arctan function. An example is given in the last annual progress report.

Based on experiments to date we judge the quality of the synthetic speech produced by our articulatory synthesis tool as good. A study of the effects of the spatial- and time-domain sampling on the synthesis quality showed that a 20-section vocal tract and a 20 kHz sampling frequency are a minimal requirement to synthesize speech with a perceptually insignificant spectral distortion. The glottal impedance and the subglottal system affect the synthesis quality as does the shape of the excitation waveform and the location of the turbulence noise source within the tract. In a study of nasalized vowels, we concluded that the maxillary sinus coupling is the major factor for nasality, while the velopharyngeal port opening only controls the on-set of nasality. As to the synthesis of fricatives using the articulatory synthesizer, we have found that their reproduction is not yet satisfactory. We have also analyzed the effects of various characteristics of the vocal system on the acoustic transfer function. This analysis provides a basis for helping the user select appropriate parameters for the articulatory synthesizer.

PROJECT REFERENCES AND PUBLICATIONS

Childers, D. G. and Wong, C. F., Measuring and modeling vocal source-tract interaction, IEEE Trans. Biomed. Engr., vol. 41, June, 1994, pp. 663-671.

Childers, D. G. and Hu, H. T., Speech synthesis by glottal linear prediction, J. Acoust. Soc. Am., vol. 96. October, 1994, pp. 2026-2036.

Childers, D. G. and Ahn, C., Modeling the glottal volume-velocity waveform for three voice types, J. Acoust. Soc. Am., vol. 97, January, 1995, pp. 505-519.

Childers, D. G., Prinicipe, J. C., and Ting, Y. T., Adaptive WRLS-VFF for speech analysis, IEEE Trans. Speech and Audio Processing, IEEE Trans. Speech and Audio Processing, vol. 3, May, 1995, pp. 209-213.

Childers, D. G., Glottal source modeling for voice conversion, Speech Communication, vol. 16, 1995, pp. 127-138.

Hsiao, Y. S. and Childers, D. G., A new approach to formant estimation and modification based on pole interaction, 30th Asilomar Conference on Signals, Systems, and Computers, November, 1996, 4 pgs.

Lee, M. and Childers, D. G., Manual glottal inverse filtering algorithm, Proceedings of the IASTED International Conference on Signal and Image Processing, November, 1996, pp. 34-37.

Hsiao, Y. S. and Childers, D. G., A modified root-finding formant estimation algorithm based on LP analysis, Proceedings of the IASTED International Conference on Signal and Image Processing, November, 1996, pp. 30-33.

AREA BACKGROUND

Our research deals with the development of models of speech production. This work involves aspects of speech synthesis and speech analysis. We are addressing issues in voice conversion (to change one speaker's voice to sound like that of another) and voice creation (to create new voices as Mel Blanc did for cartoon characters), such as modeling vocal features that are related to a speaker's age, gender, emotional state, dialect, and health. The research results may be helpful for assessing vocal quality, establishing speaker normalization, and understanding aspects of speaker dependent and independent speech recognition. We are also modeling vocal fold function to assess its role in vocal quality. We have developed three speech synthesizers that have interactive, graphics users interfaces. We also have an interactive speech analysis software system to measure various aspects of the speech signal.

AREA REFERENCES

P. B. Denes and E. N. Pinson, The Speech Chain, 2nd Edition, W. H. Freeman, 1993 (paperback).

L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, Inc., 1978.

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

None at this time.