Postscript Version

A STUDY OF VARIABILITY AND A FEATURE-BASED APPROACH
TO SPEECH RECOGNITION

Carol Y. Espy-Wilson

Electrical and Computer Engineering Department
Boston University

CONTACT INFORMATION

8 Saint Mary's Street
Third Floor
Boston, MA 02215
Phone: (617) 353-6521
Fax : (617) 353-6440
Email: espy@formant.bu.edu

WWW PAGE

http://formant.bu.edu/espy-wilson.html

PROGRAM AREA

Speech and Natural Language Understanding

KEYWORDS

Variability, phonetic features, front end, speech recognition, knowledge-based, signal representation, speaker-independence, gender-independence.

PROJECT SUMMARY

Knowledge-based Signal Representation for Speech Recognition

In this project, we made considerable progress in the development of a signal representation that consists of acoustic parameters (APs). The APs are exact measures performed on the speech signal or its time-frequency representation to provide evidence for the acoustic correlates of phonetic features. The APs were decided upon on the basis of articulatory correlates, past acoustic studies, and our acoustic analysis. Our philosophy in defining the APs is that they must be relative in time and/or frequency to reduce the effects of interspeaker variability. Such relative measures take into account the relationship between different speech sounds occurring within the same utterance and spoken by the same speaker. At the 1995 workshop, we reported on a broad-class HMM recognition task where Mel-cepstral coefficients (MFCCs) and the APs (designed for the manner-of-articulation phonetic features: sonorant, syllabic, noncontinuant and frication) were compared. The recognition results showed that the APs are better able to target the phonetic information in the speech signal and reduce speaker-dependent effects. This better performance was obtained with APs developed using histogram analysis that relies on eye-balling the data. Recently, we developed an automatic optimization procedure based on the Fisher Criterion and automatic classification trees to replace this time-consuming and subjective component in parameter design and to allow the exploration of many more APs to the find the one that works best. Repeating the recognition results mentioned above showed that the optimized APs give results that are comparable to those obtained with the hand-designed APs. Presently, we have use the TIMIT database to develop APs that target phonetic features relevant for obstruent consonants: strident, palatal, labial, alveolar, velar. In general, there were between 3 and 6 APs per phonetic feature. However, in most cases, the best AP or the top two APs resulted in 90% or more of the total correct classification. Classification experiments: Using classification trees obtained from the development stage, the APs were evaluated in classifying the phonetic features on both the development set and an independent test set. The results on the two databases are comparable indicating that the developed parameters do target the relevant phonetic information in the speech signal. General recognition experiment: Using an HMM system (1 and 8 mixtures) to compare the APs to the traditional cepstral parameters show that the APs are better able to reduce speaker variability and target the linguistic information in the speech signal. This is deduced by comparing the improvement in results going from 1 to 8 mixtures in the case of the APs (20 parameters and their first derivative) to the more substantial improvement in the case of the cepstral-based parameters (13 parameters and their first and second derivatives). In the case of the APs, the results for 1 and 8 mixtures are 63.7% and 69.4%, respectively. In the case of the cepstral parameters, the corresponding results are 54.7% and 70.4%. Gender-specific recognition experiment: In this experiment, the HMM recognizers were trained on speech produced by females and tested on speech produced by males, and vice versa. Compared to the results obtained with the cepstral parameters, the results obtained with the APs are closer to the results obtained when both males and females were used to train and test the system, indicating more robustness to gender variability. The differential in recognition results using the APs is 1.6% (training on male speech) and 4.2% (training on female speech) while it is 3.5% (training on male speech) and 8\% (training on female speech) for the cepstral parameters. The larger differential in error when training on female speech may be due to less data (296 sentences when training on female vs. 616 sentences were used when training on male speech).

An Event-Oriented Approach to Speech Recognition

An event-based broad class recognition system was developed using the APs for the manner phonetic features. The philosophy behind an event-oriented approach to speech recognition is that the acoustic manifestation of a change in the value of a phonetic feature is marked by specific events in the appropriate APs. Based on our experience, these events correspond to either abrupt acoustic changes (naturally segmenting regions of the waveform), or they mark extrema at particular instants in time. In such an approach, it is not necessary to recognize each frame of speech or to divide the speech signal into phone-like units and then recognize these chunks. As before, the task was to recognize speech as a sequence of the manner classes: syllabic, sonorant consonant, fricative, and noncontinuant. The 504 SI sentences from the TIMIT test set were used for testing. The same task was carried out by an HMM system using the APs as the front end (some had to be modified to provide information every frame). The results show that the event-based system performs comparably to the traditional HMM system when 8 mixtures were used.

Study of Speech Variability

We took a two-pronged approach to study the extent to which phonological rules stated in the literature describe the types of variability occurring in the TIMIT database. First, we investigated phonetic variability by comparing the phonetic transcription of the TIMIT database to phonemic transcriptions from an online dictionary. Second, to capture forms of variability not represented in the TIMIT transcriptions, we performed a preliminary investigation of acoustic variability by analyzing errors that occurred in our broad-class recognition task. Phonetic variability study: In this study, the TIMIT test set of 1680 sentences and an on-line dictionary of 150,000 words were used. For each word, the TIMIT and dictionary transcriptions were aligned and discrepancies were evaluated. Some of the phonological processes investigated were 1) flapping, 2) stop deletion, 3) homorganic stop insertion, 4) ruh reduction, 5) nasal deletion and geminate reduction and 6) palatalization. The frequency of occurrence was computed for each variability type and the results were grouped according to dialect region and gender. There were 8406 instances of manner variability in the 14,553 words. Dialectical differences in consonant variability were small except for /r/ deletions and word-final stop deletions. New Englanders had a much higher rate of /r/ deletions (10.9\% vs. 5.3% overall) and Northerners had a lower rate of stop deletions (47% vs. 52% overall). The utterances of the male speakers showed more variability than those of the female speakers. The most predictable and systematic forms of variability were stop deletion and flapping. The coronals /d/ and /t/ were deleted most, consistent with their unmarked status in phonological theory. Although 34\% of all stops are word final, word-final stop deletions accounted for 74.2% of the total number of stops deleted. Flapping always occurred when the consonant(s) were in an intersonorant context and they were almost always in a falling lexical stress environment. The sounds /t/ and /d/ were flapped much more often than /n/; /rt/ and /rd/ were flapped more often than /rn/; and /nt/ was flapped more often than /nd/. Acoustic variability study: 70% of the errors from a broad-class recognition task were consistent with known forms of variability. For example, 4.5\% of the 2,876 fricatives were recognized as sonorant consonants. Our analysis showed that of these 129 tokens, 82\% were the voiced fricatives (/v/, /dh/ or /z/). This ``misclassification'' is reasonable since voiced fricatives can be produced with a weakened constriction so that no turbulent noise is generated and they are manifest as sonorant consonants. In fact, our data also showed that voiced stop consonants and voiced affricates can also be produced with a weakened constriction so that they too are sometimes realized as sonorants.

Coarticulatory Stability in American English /r/s

Several researchers have reported a substantial degree of variability in how American English /r/ coarticulates with neighboring segments. In this project, acoustic and articulatory data were used to investigate this variability for seven speakers of "rhotic" American English dialects. Three issues were addressed: (1) the degree to which the major acoustic manifestation of American English /r/ (i.e., the time course of F3) reflected tongue movement for /r/, (2) the degree to which the /r/-related F3 trajectory is affected by segmental context and stress, and (3) to what extent the data support a ``coproduction" vs. a ``spreading" model of coarticulation. The /r/-related F3 trajectory durations were measured by an automatic procedure and compared across stress conditions for nonsense words of the form /'waCrav/ and /waC'rav/, where C indicates a labial, alveolar or velar consonant. These durations were compared to F3 trajectory durations in control words /'warav/ and /wa'rav/ and to the /r/-related F3 trajectory durations in real words such as ``Africa.'' Results indicated similar F3 trajectory durations in all words containing /r/, across stress and consonant contexts. This acoustic consistency supports the coproduction model in which coarticulation (of /r/) is achieved by overlap of a stable articulatory movement trajectory with articulatory movement for neighboring sounds. This interpretation, and the concordance of F3 time course with tongue movement for /r/, was supported by direct measures of tongue movement for one subject.

PROJECT REFERENCES

Bitar, N. & Espy-Wilson, C.Y. (1997) ``The Design of Acoustic Parameters for Speaker-independent Speech Recognition'', to appear Proc. of Eurospeech, Patras, Greece, September.

Boyce, S. and Espy-Wilson, C.Y. (1997). ``Coarticulatory Stability in American English /r/s'', Journal of the Acoustical Society of America, June, pp. 3741-3753.

Espy-Wilson, C.Y. and Boyce S., (1996). ``Coarticulatory stability of American English /r/'', International Conference on Spoken Language Processing, October 3-6, Philadelphia, PA.

Espy-Wilson, C.Y. & Bitar, N. (1996), ``A Knowledge-Based Signal Representation for Speech Recognition,'' ICASSP '96, May 7-10, Atlanta, GA.

Bitar, N. & Espy-Wilson, C.Y. (1995), ``Knowledge-Based vs. Cepstral-Based Parameters for Broad-Class HMM Speech Recognition,'' IEEE Workshop on Speech Recognition, December.

Paneras, D., Bitar, N., & Espy-Wilson, C.Y. (1995), ``Speech Variability in the TIMIT Database'', 130th Meeting of the Acoustical Society of America, November.

Bitar, N. & Espy-Wilson, C.Y. (1995), ``Speech Parameterization Based on Phonetic Features: application to speech recognition,'' Eurospeech-95, Madrid, Spain, September.

Bitar, N. & Espy-Wilson, C.Y. (1995), ``A Signal Representation of Speech Based on Phonetic Features,'' {\em Proc. of the 1995 IEEE Dual-Use Technologies & Applications Conference,} May.

AREA BACKGROUND

Linguists have proposed a set of 20 or so binary distinctive features which comprise a phonological description of the speech sounds in all languages. Use of phonetic features for recognition is motivated by (1) spectrogram reading experiments which show that phonetic information is represented in the speech signal, (2) cognition studies which assert that the human lexicon is organized in terms of phonetic features, (3) psychoacoustic studies which show that phonetic features play an important role in human perception and (4) variability studies which show that large acoustic changes are often the result of a change in only one or two phonetic features.

Research into recognition systems based on the explicit extraction of linguistic information has suffered in comparison to probabilistic approaches such as HMM. However, a variety of efforts have been made over the past several years to combine speech knowledge and probabilistic frameworks. Furthermore, several advances have been made in recent years that suggest that another look into feature-based recognition is warranted. These advances include an improved understanding of the phonetic features and the relations between them, a better idea of the acoustic correlates of the features and the development of theories of hierarchical structures for the representation of lexical items in terms of phonetic features.

In addition to these recent gains, the use of phonetic features for recognition is desirable because it provides a framework for handling and understanding variability, a major stumbling block in the development of recognition systems that achieve human performance.

AREA REFERENCES

Boyce, S. E., R. A. Krakow, F. Bell-Berti, and C. Gelfer. (1990). ``Converging sources of evidence for dissecting articulation into core gestures,'' Journal of Phonetics, vol.18, pp. 173-188.

Chomsky, N., and Halle, M. (1968). The Sound Pattern of English, New York: Harper and Row.

Clements, G.N. (1985). ``The geometry of phonological features,'' Phonological Yearbook, vol 2., pp. 225-252.

Fant, G. (1960). Acoustic Theory of Speech Production, The Hague: Mouton.

Jacobson, R., Fant, G., and Halle, M. (1952). ``Preliminaries to speech analysis,'' MIT Acoustics Lab. Tech. Rep. No. 13.

Lahiri, A. and Marslen-Wilson, W. (1991), ``The mental representation of lexical form: A phonological approach to the recognition lexicon,'' Cognition, 38(3):245--294.

Stevens, K. (1995). ``Applying phonetic knowledge to lexical access,'' Proc. of Eurospeech'95, vol. 1, pp. 3-10.

Stevens, K.N. and Keyser, J.K. (1989). ``Primary features and their enhancement in consonants,'' Language, vol. 65, pp. 81-106.

Zue, V. (1985). ``The use of speech knowledge in automatic speech recognition,'' Proc. IEEE vol. 73, pp. 1602-1615.

RELATED PROGRAM AREAS

Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities

POTENTIAL RELATED PROJECTS

Given the mapping between phonetic features, acoustics and articulation, the knowledge-based speech signal representation can possibly be used to help identify articulatory problems in the speech of people having difficulty whether it is due to a speech impairment, hearing loss or the learning of a second language.