Postscript Version
A STUDY OF VARIABILITY AND A FEATURE-BASED APPROACH
TO SPEECH RECOGNITION
Carol Y. Espy-Wilson
Electrical and Computer Engineering Department
Boston University
CONTACT INFORMATION
8 Saint Mary's Street
Third Floor
Boston, MA 02215
Phone: (617) 353-6521
Fax : (617) 353-6440
Email: espy@formant.bu.edu
WWW PAGE
http://formant.bu.edu/espy-wilson.html
PROGRAM AREA
Speech and Natural Language Understanding
KEYWORDS
Variability, phonetic features, front end, speech recognition,
knowledge-based, signal representation, speaker-independence,
gender-independence.
PROJECT SUMMARY
Knowledge-based Signal Representation for Speech Recognition
In this project, we made considerable progress in the development of a
signal representation that consists of acoustic parameters (APs). The
APs are exact measures performed on the speech signal or its
time-frequency representation to provide evidence for the acoustic
correlates of phonetic features. The APs were decided upon on the
basis of articulatory correlates, past acoustic studies, and our
acoustic analysis. Our philosophy in defining the APs is that they
must be relative in time and/or frequency to reduce the effects of
interspeaker variability. Such relative measures take into account
the relationship between different speech sounds occurring within the
same utterance and spoken by the same speaker.
At the 1995 workshop, we reported on a broad-class HMM recognition
task where Mel-cepstral coefficients (MFCCs) and the APs (designed for
the manner-of-articulation phonetic features: sonorant, syllabic,
noncontinuant and frication) were compared. The recognition results
showed that the APs are better able to target the phonetic information
in the speech signal and reduce speaker-dependent effects. This
better performance was obtained with APs developed using histogram
analysis that relies on eye-balling the data. Recently, we developed
an automatic optimization procedure based on the Fisher Criterion and
automatic classification trees to replace this time-consuming and
subjective component in parameter design and to allow the exploration
of many more APs to the find the one that works best. Repeating the
recognition results mentioned above showed that the optimized APs give
results that are comparable to those obtained with the hand-designed
APs.
Presently, we have use the TIMIT database to develop APs that target
phonetic features relevant for obstruent consonants: strident,
palatal, labial, alveolar, velar. In general, there were between 3
and 6 APs per phonetic feature. However, in most cases, the best AP or
the top two APs resulted in 90% or more of the total correct
classification.
Classification experiments: Using classification trees obtained from
the development stage, the APs were evaluated in classifying the
phonetic features on both the development set and an independent test
set. The results on the two databases are comparable indicating that
the developed parameters do target the relevant phonetic information
in the speech signal.
General recognition experiment: Using an HMM system (1 and 8 mixtures)
to compare the APs to the traditional cepstral parameters show that
the APs are better able to reduce speaker variability and target the
linguistic information in the speech signal. This is deduced by
comparing the improvement in results going from 1 to 8 mixtures in the
case of the APs (20 parameters and their first derivative) to the more
substantial improvement in the case of the cepstral-based parameters
(13 parameters and their first and second derivatives). In the case
of the APs, the results for 1 and 8 mixtures are 63.7% and 69.4%,
respectively. In the case of the cepstral parameters, the
corresponding results are 54.7% and 70.4%.
Gender-specific recognition experiment: In this experiment, the HMM
recognizers were trained on speech produced by females and tested on
speech produced by males, and vice versa. Compared to the results
obtained with the cepstral parameters, the results obtained with the
APs are closer to the results obtained when both males and females
were used to train and test the system, indicating more robustness to
gender variability. The differential in recognition results using the
APs is 1.6% (training on male speech) and 4.2% (training on female
speech) while it is 3.5% (training on male speech) and 8\% (training
on female speech) for the cepstral parameters. The larger
differential in error when training on female speech may be due to
less data (296 sentences when training on female vs. 616 sentences
were used when training on male speech).
An Event-Oriented Approach to Speech Recognition
An event-based broad class recognition system was developed using the APs
for the manner phonetic features. The philosophy behind an
event-oriented approach to speech recognition is that the acoustic
manifestation of a change in the value of a phonetic feature is marked
by specific events in the appropriate APs. Based on our experience,
these events correspond to either abrupt acoustic changes (naturally
segmenting regions of the waveform), or they mark extrema at
particular instants in time. In such an approach, it is not necessary
to recognize each frame of speech or to divide the speech signal into
phone-like units and then recognize these chunks.
As before, the task was to recognize speech as a sequence of the
manner classes: syllabic, sonorant consonant, fricative, and
noncontinuant. The 504 SI sentences from the TIMIT test set were used
for testing. The same task was carried out by an HMM system using the
APs as the front end (some had to be modified to provide information
every frame). The results show that the event-based system performs
comparably to the traditional HMM system when 8 mixtures were used.
Study of Speech Variability
We took a two-pronged approach to
study the extent to which phonological rules stated in the literature
describe the types of variability occurring in the TIMIT
database. First, we investigated phonetic variability by comparing the
phonetic transcription of the TIMIT database to phonemic
transcriptions from an online dictionary. Second, to capture forms of
variability not represented in the TIMIT transcriptions, we performed
a preliminary investigation of acoustic variability by analyzing
errors that occurred in our broad-class recognition task.
Phonetic variability study: In this study, the TIMIT test set of 1680
sentences and an on-line dictionary of 150,000 words were used. For
each word, the TIMIT and dictionary transcriptions were aligned and
discrepancies were evaluated. Some of the phonological processes
investigated were 1) flapping, 2) stop deletion, 3) homorganic stop
insertion, 4) ruh reduction, 5) nasal deletion and geminate reduction
and 6) palatalization. The frequency of occurrence was computed for
each variability type and the results were grouped according to
dialect region and gender.
There were 8406 instances of manner variability in the 14,553 words.
Dialectical differences in consonant variability were small except for
/r/ deletions and word-final stop deletions. New Englanders had a
much higher rate of /r/ deletions (10.9\% vs. 5.3% overall) and
Northerners had a lower rate of stop deletions (47% vs. 52%
overall). The utterances of the male speakers showed more variability
than those of the female speakers. The most predictable and
systematic forms of variability were stop deletion and flapping. The
coronals /d/ and /t/ were deleted most, consistent with their unmarked
status in phonological theory. Although 34\% of all stops are word
final, word-final stop deletions accounted for 74.2% of the total
number of stops deleted. Flapping always occurred when the
consonant(s) were in an intersonorant context and they were almost
always in a falling lexical stress environment. The sounds /t/ and
/d/ were flapped much more often than /n/; /rt/ and /rd/ were flapped
more often than /rn/; and /nt/ was flapped more often than /nd/.
Acoustic variability study: 70% of the errors from a broad-class
recognition task were consistent with known forms of variability. For
example, 4.5\% of the 2,876 fricatives were recognized as sonorant
consonants. Our analysis showed that of these 129 tokens, 82\% were
the voiced fricatives (/v/, /dh/ or /z/). This ``misclassification''
is reasonable since voiced fricatives can be produced with a weakened
constriction so that no turbulent noise is generated and they are
manifest as sonorant consonants. In fact, our data also showed that
voiced stop consonants and voiced affricates can also be produced with
a weakened constriction so that they too are sometimes realized as
sonorants.
Coarticulatory Stability in American English /r/s
Several
researchers have reported a substantial degree of variability in how
American English /r/ coarticulates with neighboring segments. In this
project, acoustic and articulatory data were used to investigate this
variability for seven speakers of "rhotic" American English dialects.
Three issues were addressed: (1) the degree to which the major
acoustic manifestation of American English /r/ (i.e., the time course
of F3) reflected tongue movement for /r/, (2) the degree to which the
/r/-related F3 trajectory is affected by segmental context and stress,
and (3) to what extent the data support a ``coproduction" vs. a
``spreading" model of coarticulation. The /r/-related F3 trajectory
durations were measured by an automatic procedure and compared across
stress conditions for nonsense words of the form /'waCrav/ and
/waC'rav/, where C indicates a labial, alveolar or velar consonant.
These durations were compared to F3 trajectory durations in control
words /'warav/ and /wa'rav/ and to the /r/-related F3 trajectory
durations in real words such as ``Africa.'' Results indicated similar
F3 trajectory durations in all words containing /r/, across stress and
consonant contexts. This acoustic consistency supports the
coproduction model in which coarticulation (of /r/) is achieved by
overlap of a stable articulatory movement trajectory with articulatory
movement for neighboring sounds. This interpretation, and the
concordance of F3 time course with tongue movement for /r/, was
supported by direct measures of tongue movement for one subject.
PROJECT REFERENCES
Bitar, N. & Espy-Wilson, C.Y. (1997) ``The Design of
Acoustic Parameters for Speaker-independent Speech Recognition'', to appear
Proc. of Eurospeech, Patras, Greece, September.
Boyce, S. and Espy-Wilson, C.Y. (1997).
``Coarticulatory Stability in American English /r/s'', Journal
of the Acoustical Society of America, June, pp. 3741-3753.
Espy-Wilson, C.Y. and Boyce S., (1996).
``Coarticulatory stability of American English /r/'', International
Conference on Spoken Language Processing, October 3-6, Philadelphia, PA.
Espy-Wilson, C.Y. & Bitar, N. (1996), ``A Knowledge-Based
Signal Representation for Speech Recognition,'' ICASSP '96,
May 7-10, Atlanta, GA.
Bitar, N. & Espy-Wilson, C.Y. (1995), ``Knowledge-Based
vs. Cepstral-Based Parameters for Broad-Class HMM Speech Recognition,''
IEEE Workshop on Speech Recognition, December.
Paneras, D., Bitar, N., & Espy-Wilson, C.Y. (1995),
``Speech Variability in the TIMIT Database'', 130th Meeting of the
Acoustical Society of America, November.
Bitar, N. & Espy-Wilson, C.Y. (1995), ``Speech
Parameterization Based on Phonetic Features: application to speech
recognition,'' Eurospeech-95, Madrid, Spain, September.
Bitar, N. & Espy-Wilson, C.Y. (1995), ``A Signal
Representation of Speech Based on Phonetic Features,'' {\em Proc. of
the 1995 IEEE Dual-Use Technologies & Applications Conference,} May.
AREA BACKGROUND
Linguists have proposed a set of 20 or so binary distinctive features which
comprise a phonological description of the speech sounds in all
languages. Use of phonetic features for recognition is motivated by
(1) spectrogram reading experiments which show that phonetic
information is represented in the speech signal, (2) cognition studies
which assert that the human lexicon is organized in terms of phonetic
features, (3) psychoacoustic studies which show that phonetic
features play an important role in human perception and (4)
variability studies which show that large acoustic changes are often
the result of a change in only one or two phonetic features.
Research into recognition systems based on the explicit extraction of
linguistic information has suffered in comparison to
probabilistic approaches such as HMM. However, a variety of efforts
have been made over the past several years to combine speech knowledge
and probabilistic frameworks. Furthermore, several advances have been
made in recent years that suggest that another look into feature-based
recognition is warranted. These advances include an improved
understanding of the phonetic features and the relations between them,
a better idea of the acoustic correlates of the features and the
development of theories of hierarchical structures for the
representation of lexical items in terms of phonetic features.
In addition to these recent gains, the use of phonetic features for
recognition is desirable because it provides a framework for handling
and understanding variability, a major stumbling block in the
development of recognition systems that achieve human performance.
AREA REFERENCES
Boyce, S. E., R. A. Krakow, F. Bell-Berti, and C. Gelfer. (1990).
``Converging sources of evidence for dissecting articulation
into core gestures,'' Journal of Phonetics, vol.18, pp. 173-188.
Chomsky, N., and Halle, M. (1968). The Sound Pattern of
English, New York: Harper and Row.
Clements, G.N. (1985). ``The geometry of phonological
features,'' Phonological Yearbook, vol 2., pp. 225-252.
Fant, G. (1960). Acoustic Theory of Speech Production,
The Hague: Mouton.
Jacobson, R., Fant, G., and Halle, M. (1952). ``Preliminaries
to speech analysis,'' MIT Acoustics Lab. Tech. Rep. No. 13.
Lahiri, A. and Marslen-Wilson, W. (1991), ``The mental representation
of lexical form: A phonological approach to the recognition lexicon,''
Cognition, 38(3):245--294.
Stevens, K. (1995). ``Applying phonetic knowledge to lexical
access,'' Proc. of Eurospeech'95, vol. 1, pp. 3-10.
Stevens, K.N. and Keyser, J.K. (1989). ``Primary features and
their enhancement in consonants,'' Language, vol. 65, pp. 81-106.
Zue, V. (1985). ``The use of speech knowledge in automatic
speech recognition,'' Proc. IEEE vol. 73, pp. 1602-1615.
RELATED PROGRAM AREAS
Adaptive Human Interfaces, Intelligent Interactive
Systems for Persons with Disabilities
POTENTIAL RELATED PROJECTS
Given the mapping between phonetic features, acoustics and
articulation, the knowledge-based speech signal representation can
possibly be used to help identify articulatory problems in the
speech of people having difficulty whether it is due to a speech
impairment, hearing loss or the learning of a second language.