Postscript Version

From MRI and Acoustic Data to Articulatory Synthesis IRI-9503089 (Career Development Award)

Abeer Alwan

Department of Electrical Engineering
University of California, Los Angeles

CONTACT INFORMATION

66-147 EIV
405 Hilgard Avenue
Los Angeles, CA 90095
Phone: (310) 206-2231
Fax : (310) 206-4685
Email: alwan@icsl.ucla.edu

WWW PAGE

http://www.ee.ucla.edu/faculty/Alwan.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Speech production models, articulatory data, articulatory synthesis, vocal tract geometry, area functions, tongue shapes, MRI, EPG, fricatives, liquids.

PROJECT SUMMARY

The primary focus of speech production research is directed towards improved understanding and quantitative characterization of the articulatory dynamics, acoustics, and cognition of both normal and pathological human speech. Physiologically and physically motivated production models are also important for the development of high-quality speech synthesizers and articulatory-based recognition systems. Such efforts are, however, frequently challenged by the lack of appropriate physical and physiological data. In this study, articulatory data are obtained from Magnetic Resonance Images (MRI), and Dynamic Electropalatography (EPG). MRI reveals the 3D geometry of the vocal tract while EPG is important for studying articulatory dynamics. The study fosters cross-disciplinary activities in Electrical Engineering, Radiology, and Linguistics.

We have gained access to the Medical Imaging Facilities at Cedars Sinai Hospital in Los Angeles. MR images of four phonetically-trained, native American English speakers (2 males and 2 females) were collected and analyzed. Imaging was done in the sagittal, coronal, and axial planes using a GE 1.5 Tesla SIGNA machine (about 3.2 sec/image) with an image slice thickness of 3 mm and no interscan spacing. EPG data were collected at the UCLA phonetics laboratory.

In the last NSF workshop, we reported analysis and modeling results for the eight fricatives of American English [8, 11, 14] . Since then, we have analyzed the articulatory patterns of the liquids [2, 3, 12] and have succeeded in modeling the lateral (/l/) [4, 6, 7]. We have also reported imaging results for Tamil liquids [5]. Knowledge of the 3D vocal-tract geometry is essential in characterizing the liquids (/l/ and /r/) since side branches are created during /l/ production, and /r/ production is characterized by a large sublingual cavity.

Both dark and light allophones of /l/ and both retroflex and bunched allophones of /r/ were studied. Vocal tract dimensions (lengths, areas, and volumes) were measured from MR images, while EPG data were used for studying inter- and intra-speaker variabilities in lingua-palatal contact patterns. Acoustic modeling was based on the MRI-derived vocal-tract area functions and the acoustic spectra of these sounds. Acoustic modeling utilized an analog circuit simulator [13].

MR images for /l/ indicate that midsagittal tongue contours can be different across subjects. Common characteristics, however, were revealed in 3D tongue shapes, area functions, and linguopalatal contact profiles. This indicates the inadequacy of X-rays in characterizing these sounds. We also observed invariant, across subjects, tongue-shaping mechanisms for the laterals: an alveolar contact, inward-lateral compression, and convex shaping of the dorsum and the posterior tongue body. Medial tongue grooving, which has been hypothesized as a feature for /l/, appears to be a secondary feature and is likely to be affected by anatomical differences. Analysis results of the EPG data were consistent with those of the MRI study.

Articulatory data of /r/ show that the vocal tract is characterized by three cavities due to the presence of two supraglottal constrictions. The primary constriction occurs in the oral cavity and the secondary constriction, in the pharyngeal cavity. The oral constriction may occur anywhere in the palatal region. The invariant feature for /r/ seems to be the existence of a large sublingual cavity anterior to the oral constriction. Inter-subject variabilities were observed in the location and the way the primary oral constriction was formed.

None of our subjects showed a truly retroflexed /r/ suggesting that the extreme form of retroflex /r/ may not be prevalent in American English. Our data also indicate that /r/ tongue shapes belong to a continuum of possible shapes created between the two `extreme' configurations, namely, the canonical retroflex and bunched varieties with a greater tendency towards the bunched configuration in the present data. As a result, the rhotic approximant in American English can be specified by a three-cavity model wherein a more anterior primary oral constriction is associated with a more superior secondary pharyngeal constriction. Furthermore, the vocal tract of a canonical retroflex /r/, which may occur in other English dialects or in other languages, can be treated as a special case of the three-cavity model, wherein the secondary pharyngeal constriction, corresponding to the anteriorly-located tongue tip-up oral constriction, is absent. Evidence supporting these observations was found in the results of an imaging study of Tamil liquids [5].

Modeling: We utilized an analog circuit simulator to generate a flexible articulatory synthesizer using a transmission-line model of the vocal tract. There are several advantages of our synthesizer when compared to existing articulatory synthesizers: 1) side branches (needed for modeling nasals and /l/, for example) can be easily simulated by additional transmission lines in parallel, 2) drive-dependent sources, at any location, could be added, and 3) the number of sections can be varied without changing the sampling rate as is the case with most synthesizers. The lumped circuit approximation is valid as long as the cross dimensions are small when compared to the wavelength of the sound. Small-signal analysis is used to determine the formant frequencies from the frequency response of the circuit. The transmission-line model, with circuit components calculated from MRI-derived area functions, resulted in excellent predictions of the formant frequencies for /l/.

In summary, our novel findings are: 1) obtaining realistic estimates of the vocal-tract geometry during the sustained production of fricatives and liquids for four speakers (2 males and 2 females). The data revealed variant and invariant articulatory features for these sounds, 2) an analog circuit simulator was used to mathematically model speech production, and 3) based on the MRI and acoustic data, we were able to accurately model the production of the fricatives and the laterals, illustrating the acoustic contribution of side branches and sublingual spaces.

PROJECT REFERENCES

1. S. Narayanan, A. Alwan, and Y. Song, ``New Results in Vowel Production: MRI and EPG data,'' to appear in the Proceedings of Eurospeech , Patras, Greece, September 1997.

2. S. Narayanan, A. Alwan, and K. Haker,``Towards articulatory-acoustic models of liquid consonants Part I: The Laterals", Journal of the Acoustical Society of America ( JASA ), Vol. 101, No. 2, pp. 1064-1077, February 1997.

3. A. Alwan, S. Narayanan, and K. Haker,``Towards articulatory-acoustic models of liquid consonants Part II: The Rhotics", JASA , Vol. 101, No. 2, pp. 1078-1089, February 1997.

4. P. Bangayan, A. Alwan, and S. Narayanan, ``A transmission-line model of the lateral approximants'', Proc. of the Acous. Societies of Amer. and Japan, Vol. 100, No. 4, December 1996.

5. S. Narayanan, A. Kaun, D. Byrd, P. Ladefoged, and A. Alwan, ``Liquids in Tamil,'' Proc. of the Int. Conf. Spoken Lang. Proc. ( ICSLP), pp. 797-800, Philadelphia, October 1996.

6. P. Bangayan, A. Alwan, and S. Narayanan, ``From MRI and acoustic data to articulatory synthesis: a case study of the laterals'', Proc. ICSLP , pp. 793-796, Philadelphia, October 1996 (Invited).

7. Philbert Bangayan, ``A Transmission-line model of /l/ based on MRI-derived data,'' M.S. thesis, Electrical Engineering Department, UCLA, September 1996.

8. S. Narayanan and A. Alwan, ``Parametric Hybrid Source Models for Voiced and Voiceless Fricative Consonants'', Proc. Int. Conf. Acoustics, Speech, and Signal Processing ( ICASSP ) 96, Vol. I, pp. 377-340, Atlanta, GA, May 1996.

9. S. Narayanan and A. Alwan, ``Imaging Applications in Speech Production Research,'' SPIE 96 Medical Imaging Conference, 2709, pp. 120-131, Newport Beach, February 96 (Invited).

10. A. Alwan, S. Narayanan, B. Strope, and A. Shen, ``Speech Production and Perception Models and their Applications to Synthesis, Recognition, and Coding,'' Proc. of the Int. Symp. Sig. Sys. and Elec. ( ISSSE ), pp. 367-372, October 1995 (Invited).

11. S. Narayanan, A. Alwan, and K. Haker, ``An Articulatory Study of Fricative Consonants using MRI,'' JASA , pp. 1325-1364, September 1995.

12. S. Narayanan, A. Alwan, and K. Haker, ``An Articulatory Study of Liquid Consonants in American English,'' Proc. of the Int. Con. of Phon. Sci. ( ICPhS ), Stockholm, Sweden, Vol. 3, pp. 576-579, August 1995.

13. J. Rael, J. Chang, and A. Alwan. ``A Computationally-Efficient Articulatory Synthesizer,'' JASA , Vol. 97, (5), 3245, May 1995.

14. S. Narayanan, ``Fricative consonants: an articulatory, acoustic, and systems study,'' Ph.D. dissertation, Electrical Engineering Department, UCLA, June 1995.

15. Y. Song, ``Finite time-difference simulations of speech production,'' M.S. thesis, Electrical Engineering Department, UCLA, July 1995.

16. S. Narayanan, A. Alwan, and K. Haker, ``An MRI Study of Fricative Consonants,'' Proc. of ICSLP , Vol. 2, pp. 627-630, Japan, September 1994.

AREA BACKGROUND

In previous studies, information regarding the vocal-tract geometry during speech production was mainly derived from lateral X-ray data [Perkell, 1969]. The main limitations of X-rays include radiation risks and difficulty in accurately deducing the cross-sectional morphology from midsagittal profiles. Current imaging techniques used in speech research include ultrasound imaging [e.g., Stone and Lundberg, 1996], structural and functional MRI, and video fibroscopy and imaging.

The use of a particular imaging technology is dictated by a number of considerations such as: 1) the region of the articulatory system under investigation. For example, lip shapes can be easily imaged using video techniques, while tongue-shape analysis requires the use of ultrasound or MRI. 2) Qualitative and/or quantitative analysis. For example, ultrasound images are useful if only qualitative analysis of tongue shapes are needed while MRI data can be used for accurate length, area, and volume measurements of the human vocal tract.

In addition to facilitating accurate measurements of vocal-tract dimensions, MRI does not pose any known radiation risks. The low image sampling rate, however, has restricted MRI use to the study of sustained speech sounds, corresponding to static vocal-tract shapes. It should be noted that fMRI techniques can facilitate imaging of tongue dynamics; these techniques, however, still suffer from low SNR making them inappropriate for accurate vocal-tract measurements. In addition, the high expense associated with using MRI equipment, has restricted its use in speech research. Previous MRI studies have been mostly limited to vowels [e.g., Baer et al, 1991] and nasal consonants [Dang and Honda, 1996] and to imaging only one subject. Our studies provide tongue-shape and area functions of 4 subjects (2 males and 2 females) for a wide range of sounds (vowels, fricatives, and liquids.)

AREA REFERENCES

Baer, T., Gore, J.C., Gracco, L.C., and Nye, P.W., ``Analysis of vocal tract shape and dimensions using magnetic resonance imaging: Vowels,'' JASA , 90(2), 799-828, 1991.

Dang, J.; Honda, K., ``Acoustic characteristics of the human paranasal sinuses derived from transmission characteristic measurement and morphological observation,'' JASA , 100(5), 3374-83, 1996.

Perkell, Joseph S. Physiology of speech production: results and implications of a quantitative cineradiographic study . MIT Press, Cambridge, Mass, 1969.

Stone, M.; Lundberg, A., ``Three-dimensional tongue surface shapes of English consonants and vowels,'' JASA , 99(6), 3728-37, 1996.

Story, B.H.; Titze, I.R.; Hoffman, E.A.,``Vocal tract area functions from magnetic resonance imaging,'' JASA , 100(1), 537-54, 1996.

RELATED PROGRAM AREAS

1. Virtual Environments.
3. Other Communication Modalities.
4. Adaptive Human Interfaces.
6. Intelligent Interactive Systems for Persons with Disabilities.

POTENTIAL RELATED PROJECTS

1. Virtual Environments: Uncovering tongue shapes that characterize different speech sounds and the underlying articulatory-to-acoustic relations can result in improvements to current artificial talking heads (computer facial animation/synthesizers).

3. Other Communication Modalities: Articulatory data obtained from imaging techniques facilitate a better understanding of the interplay between acoustic and visual (tongue and facial) cues in speech perception.

4. Adaptive Human Interfaces: Articulatory-based speech recognition systems can benefit from knowing the nature of inter- and intra-speaker variabilities in the articulatory domain.

6. Intelligent Interactive Systems for Persons with Disabilities: Understanding what aspects of the tongue shape are invariant across subjects (for example, grooving along the midsagittal line seems to be an invariant feature for /s/ while for /l/, it is not) can assist speech pathologists and language training specialists.