Oregon Graduate Institute of Science and Technology
The project aims for using information sub-streams derived from relatively long (of the order of several hundred ms) segments of speech. Such sub-streams may consist e.g. of components of the modulation spectrum of speech. Since not all components of the modulation spectrum of speech are equally important and may be differently affected by non-linguistic components of speech [1] they may prove to be efficient in the multi-stream ASR.
[1] [H. Hermansky : Should Recognizers Have Ears, Keynote Paper, Proc. ESCA/NATO Workshop on Speech Recognition for Unknown Communication Channels, Pont-A-Mousson, France, April 1997]
[2] H. Hermansky et al.: Towards ASR on partially corrupted speech, Proc. ICSLP-06, Philadelphia, October 1996.
[3] S. Tibrewala and H. Hermansky: Sub-band based recognition of noisy speech, Proc. ICASSP97 Munich, April 1997].
[4] H. Hermansky et al.: Compensation for the effect of the communication channel in auditory-like analysis of speech, Eurospeech 91, Genova, Italy, 1990,
[5] H. N Hermansky and N. Morgan: RASTA Processing of Speech, IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 4, pp. 587-589,: 1994].
The multi-stream approach to automatic speech recognition (ASR) uses several, preferably independent, information sub-streams derived from a speech communication channel to extract the linguistic information from the channel. Non-linguistic components of the signal such as noise may affect the different information sub-streams differently. Information sub-streams may consist e.g. of speech filtered through various band-pass filters (the multi-band ASR paradigm [2]). The multi-band paradigm was demonstrated to be efficient in ASR of speech affected by frequency selective noise [3].
The information from syllable-length segments of speech allows for exploration of medium-length correlations between short-term features of speech. This allows for distinguishing between components of the signal attributed to speech and components attributed to some important types of the non-linguistic information in the signal. It has been successfully used e.g. in RASTA processing of speech [4, 5].