Dirk Van Compernolle
K.U. Leuven---ESAT, Heverlee, Belgium
Speech enhancement in the past decades has focused on the suppression of additive background noise. From a signal processing point of view additive noise is easier to deal with than convolutive noise or nonlinear disturbances. Moreover, due to the bursty nature of speech, it is possible to observe the noise by itself during speech pauses, which can be of great value.
Speech enhancement is a very special case of signal estimation as speech is nonstationary, and the human ear---the final judge---does not believe in a simple mathematical error criterion. Therefore subjective measurements of intelligibility and quality are required.
Thus the goal of speech enhancement is to find an optimal
estimate (i.e., preferred by a human listener)
, given a noisy measurement
. A number of overview papers can be
found in [Eph92] and [VC92].
The relative unimportance of phase for speech quality has
given rise to a family of speech enhancement algorithms based on
spectral magnitude estimation.
These are frequency-domain estimators in which an estimate of the clean-speech spectral magnitude is recombined with the noisy phase before
resynthesis with a standard overlap-add procedure
(Figure
).
Figure: Speech Enhancement by Spectral Magnitude Estimation.
The name spectral subtraction is loosely used for many of the algorithms falling in this class [Bol79,BSM79].
This is the simplest of all variants. It makes use of the fact that power spectra of additive independent signals are also additive and that this property is approximately true for short-time estimates as well. Hence, in the case of stationary noise, it suffices to subtract the mean noise power to obtain a least squares estimate of the power spectrum.

The greatest asset of spectral subtraction lies in its simplicity and the fact that all that is required, is an estimate of the mean noise power and that the algorithm doesn't need any signal assumptions. At the same time the latter is its great weakness. Within the framework occasional negative estimates of the power spectrum can occur. To make the estimates consistent some artificial flooring is required, which yields a very characteristic musical noise, caused by the remaining isolated patches of energy in the time-frequency representation.
Much effort has been put into reducing this annoying musical noise. One effective way is smoothing over time of the short-time spectra. This has the contrary effect, however, of introducing echoes. While reducing the average level of the background noise substantially, plain spectral subtraction has been rather ineffective in improving intelligibility and quality for broadband background noise.
Power spectral subtraction is a minimum mean square estimator with little or no assumptions about the prior distributions for power spectral values of speech and noise. This is the underlying reason why ad hoc operations like clipping are necessary. Within the framework of spectral magnitude estimation two major improvements are: (i) modeling of realistic a priori statistical distributions of speech and noise spectral magnitude coefficients [EM84], (ii) minimizing the estimation error in a domain which is perceptually more relevant than the power spectral domain (e.g., log magnitude domain) [PB84,EM85,VC89].
Minimum mean square error estimators (MMSEEs) have been developed under various assumptions such as Gaussian sample distributions, lognormal distribution of spectral magnitudes, etc. While improving on quality, these estimators tend to be complex and computationally demanding.
In a first generation the MMSEEs used a single distribution modeling all speech and one modeling all noise. Significant improvement is still possible if one takes into account the nonstationarity of the speech signal (and the noise). The use of local speech models implies much smaller variances in the models and tighter estimates. There are two possible approaches: (i) the incoming speech is aligned with an ergodic (fully-connected) HMM in which a separate MMSEE is associated with each state [EMJ89], (ii) the parameters in a simple parametric speech model can be continuously adapted on the basis of the observations [XVC93]. In the first approach a set of possible states has to be created during a training phase and this should be a complete set. In the second approach no explicit training is required, but a simpler model may be needed to make the continuous parameter updates feasible.
It is obvious that neither the state association nor the parameter updates will be trivial operations and that this adds another level of complexity to the spectral estimation problem. A side effect of these methods is that they require dynamic time alignment which is inherently noncausal. While at most a few frames extra delay is inserted, this may be a concern in some applications.
The Wiener filter obtains a least squares estimate of
under stationarity assumptions of
speech and noise. The construction of the Wiener filter requires an
estimate of the power spectrum of the clean speech and the noise:

The previous discussion on global and local speech and noise models equally applies to Wiener filtering. Wiener filtering has the disadvantage, however, that the estimation criterion is fixed.
Microphone arrays exploit the fact that a speech source is quite stationary and therefore, by using beamforming techniques, can suppress nonstationary interferences more effectively than any single sensor system. The simplest of all approaches is the delay and sum beamformer that phase aligns incoming wavefronts of the desired source before adding them together [Fla85]. This type of processing is robust and needs only limited computational hardware, but requires a large number of microphones to be effective. An easy way to achieve uniform improvement over the wide speech bandwidth is to use a subband approach together with a logarithmically spaced array. Different sets of microphones are selected to cover the different frequency ranges [Sil87]. A much more complex alternative is the use of adaptive beamformers, in which case each incoming signal is adaptively filtered before being added together. These arrays are most powerful if the noise source itself is directional. While intrinsically much more powerful, the adaptive beamformer is prone to signal distortion in strong reverberation. A third class of beamformers is a mix of the previous schemes. A number of digital filters are predesigned for optimal wideband performance for a set of look directions. The adaptation now exists in selecting the optimal filter at any given moment using a proper tracking mechanism. Under typical reverberant conditions, this last approach may prove the best overall solution. It combines the robustness of a simple method with the power of digital filtering.
While potentially very powerful, microphone arrays bring about a significant hardware cost due to the number of microphones and/or required adaptive filters. As a final remark it should be said that apart from noise suppression alone, microphone arrays help to dereverberate the signals as well.
The most substantial progress in the past decade has come from the incorporation of a model of the nonstationary speech signal into the spectral subtraction and Wiener filtering frameworks. The models under consideration have mostly been quite simple. It may be expected that the use of more complex models, borrowed from speech recognition work, will take us even further (cf. section 1.4). This line of work is promising from a quality point of view but implies much greater computational complexity as well. At the same time these models may have problems in dealing with events that did not occur during the training phase. Therefore the truly successful approaches will be those who strike the optimal balance between sufficiently detailed modeling of the speech signal to have a high quality estimator and a sufficiently weak model to allow for plenty of uncertainties.
Microphone arrays are promising but are expensive and have to develop further. The combination of single-sensor and multiple-sensor noise suppression techniques remains a virtually unexplored field.