next up previous contents index
Next: 10.2 Speech Coding Up: 10 Transmission and Storage Previous: 10 Transmission and Storage

Chapter 10: Transmission and Storage

10.1 Overview

Isabel Trancoso
Instituto de Engenharia de Sistemas e Computadores, Lisbon, Portugal
and Instituto Superior Tecnico, Lisbon, Portugal

This chapter is devoted to two closely linked areas of speech processing: coding and enhancement. For many years, these have been active areas of research, motivated by the increasing need for speech compression for bandlimited transmission and storage, and, on the other hand, for the need to improve the intelligibility of speech contaminated by noise.

In an age where the word gigabit became common when talking about channel or disk capacity, the aim of compression is not clear to everyone and one needs to justify it by describing the myriad of new applications demanding less and less bits per second and the rapidly expanding corpora.

Until the late seventies, research in speech compression followed two different directions: vocoders (abbreviation of voice coders) and waveform coders. The two approaches substantially differ in their underlying principles and performance. Whereas the first explore our knowledge of speech production, attempting to represent the signal spectral envelope in terms of a small number of slowly varying parameters, the latter aim at a faithful reproduction of the signal either in the time or frequency domains. They also represent two opposite choices in terms of the interleaving of the four main dimensions of the performance of speech coding: bit rate, speech quality, algorithm complexity and communication delay. Vocoders achieve considerable bit rate savings at the cost of quality degradation, being aimed at bit rates below 2 to 4 kbps [Tre82]. For waveform coders, on the other hand, the preservation of the quality of the synthesized speech is the prime goal, which demands bit rates well above 16 kbps [JN84]. For an excellent overview of the main speech coding activities at the end of that decade, see [F79].

The next decade saw an explosion of work on speech coding, although most of the new coders could hardly be classified according to the waveform-coder/vocoder distinction. This new generation of coders overcame the limitations of the dual-source excitation model typically adopted by vocoders. Complex prediction techniques were adopted, the masking properties of the human ear were exploited, and it became technologically feasible to quantize parameters in blocks (VQ---vector quantization), instead of individually, and use computationally complex analysis-by-synthesis procedures. CELP [SA85] multi-pulse [AR82] and regular-pulse [KDS86] excitation methods are some of the most well-known new generation coders in the time domain, whereas in the frequency domain one should mention sinusoidal/harmonic [AS84,MQ86] and multi-band excited coders [GL88]. Variants of these coders have been standardized for transmission at bit rates ranging from 13 down to 4.8 kbps, and special standards have also been derived for low-delay applications (LD-CELP) [Che91]. (See also [ACG91] and [FS91] for collections of extended papers on some of the most prominent coding methods of this decade.)

Nowadays, the standardization effort in the cellular radio domain that motivated this peak of coding activity is not so visible, and the research community is seeking new avenues. The type of quality that can be achieved with the so-called telephone bandwidth (3.2 kHz) is no longer enough for a wide range of new applications demanding wide-band speech or audio coding. At these bandwidths (5 to 20 kHz), waveform coding techniques of the sub-band and transform coding type have been traditionally adopted for high bit rate transmission. The need for 8-to-64 kbps coding is pushing the use of techniques such as linear prediction for these higher bandwidths, despite the fact that they are typical of telephone speech. The demand for lower bit rates for telephone bandwidth is, however, far from exhausted. New directions are being pursued to cope with the needs of the rapidly evolving digital telecommunication networks. Promising results have been obtained with approaches based, for instance, on articulatory representations, segmental time-frequency models, sophisticated auditory processing, models of the uncertainty in the estimation of speech parameters, etc. The current efforts to integrate source and channel coding are also worthy of mention.

Although the main use of speech coding so far has been transmission, speech encoding procedures based on Huffman coding of prediction residuals have lately become quite popular for the storage of large speech corpora.

The last part of this chapter covers an area closely related to coding and recognition, denoted as speech enhancement. The goal of speech enhancement is quality and/or intelligibility increase for a broad spectrum of applications, by (partly) removing the noise which overlaps with the speech signal in both time and frequency. The first noise-suppression techniques using only one microphone adopted single-filter approaches, either of the spectral-subtraction type or based on MAP or MMSE estimators. In the last few years, several pattern matching techniques have been proposed, neural networks have become quite popular as well and a number of robust parameterization methods and better metrics have emerged to improve the recognition of noisy speech. Multiple-microphone approaches can also be adopted in several applications. For an extended overview of enhancement methods, see [LO79] and [Bol91].



next up previous contents
Next: 10.2 Speech Coding Up: 10 Transmission and Storage Previous: 10 Transmission and Storage