A better motivated approach is to consider the problem of finding the maximum-likelihood estimates of the speech and the model parameters given the noisy data. However, even for linear models, this represents a difficult nonlinear optimization problem. Lim and Oppenheim [16], proposed finding an approximate maximum a posteriori estimation solution by iteratively Wiener filtering and using a least-squares approach to fit an LPC model. Since then, a number of researchers have proposed variations on this method, which include using the Expectation-Maximization (EM) algorithm [44,45,46,47], accommodating colored noise [48], and placing perceptual constraints on the iterative search [49].
Wan and Nelson [17] have proposed a related approach, where
neural autoregressive models are used. The speech model is the same
nonlinear autoregression as in Equations 4 and 5
of Section 14.2.2. However, to avoid using a single model to
describe the entire nonstationary speech signal (or requiring the
complexity of model-switching methods), the speech is windowed into
approximately stationary segments, with a different model used for
each segment. With a state-space representation of the speech model,
the EKF method discussed in Section 14.4.2 gives the
maximum-likelihood estimate of the speech assuming the model is known.
However, as no clean data set is used, the model parameters themselves
must now be learned on-line from the noisy data for each window of
speech. To allow the simultaneous estimation of the speech model and
speech signal, a separate set of state-equations for the parameters of
the neural network (weight vector
) is formulated:
| (17) | ||||
| (18) |
![]() |
![]() |
This weight EKF can be run in parallel with the EKF for state
estimation, resulting in the Dual Extended Kalman Filter (Dual
EKF) [51], shown in Figure 14.9. At each time step, the
current estimate of
is used by the weight filter, and the
current estimate of
is used by the state filter
9. This provides a very effective method for solving
the maximum-likelihood estimates for the speech signal given only the
noise source. Additional issues related to recurrent training, error
coupling,
the relationship of the algorithm to EM, as well as a
two-observation form of the weight EKF are discussed in
[51,52].
The result of applying the Dual EKF to a speech signal corrupted with
simulated nonstationary bursting noise is shown in
Figure 14.10. The method was applied to successive 64ms
(512 point) windows of the signal, with a new window starting every
8ms (64 points). A normalized Hamming window was used to
emphasize data in the center of the window, and deemphasize data in
the periphery10. Feedforward
networks with 10 inputs, 4 hidden units, and 1 output were used. Weights typically
converged in less than 20 epochs. The results in the figure were
computed assuming both
and
were known. The
average SNR is improved by 9.94 dB, with little resultant
distortion. When
and
are estimated using only the noisy signal,11
similar results are achieved with an SNR improvement of 8.50 dB. In
comparison, classical techniques of spectral
subtraction [19] and adaptive RASTA processing
[53] achieve SNR improvements of only .65 and
1.26 dB, respectively. Experiments where real-world colored noise is
added to the signal have also been performed. An advantage of the
Kalman framework is that colored noise can be elegantly addressed by
incorporating an additional state-space representation of the noise
process. This modification affects both the state-estimation and the
weight estimation equations.
In principle, this method can accommodate any speaker, noise, or noise level encountered. In this sense, it is more in the spirit of spectral subtraction, which works independently of the type of signal it is estimating. However, like spectral subtraction, the Dual EKF algorithm requires estimation of noise statistics.
While the approach does away with the need for a training set, there is considerable computational cost in training the neural networks on-line. Furthermore, the windowing of the data, which addresses the nonstationarity issue, also limits the size of the network structures that can be used. While for small windows of speech, compact models are sufficient ( e.g., vocoder technology), this also questions whether the approach fully utilizes the flexibility of neural modeling.
A possible direction of research which addresses some of these issues is an intermediate approach which makes some use of pre-trained models. This would be a state-dependent approach which selects among pre-trained class-based models using an HMM (see Section 14.4.2), and then adapts the selected model on-line to the noisy data12. This could produce faster convergence, avoid the need to explicitly window the data, and allow larger networks to be used.