An alternative approach to time-domain filtering combines predictive
modeling with a state-space estimation procedure. The underlying
assumption is that the noisy speech
can be accurately modeled
as a nonlinear autoregression (AR) with both process and additive
observation noise:
Note that if
is linear, the AR formulation corresponds to the
classic Linear Predictive Coding (LPC) model of speech. It has been
demonstrated (see Tishby [13]) that neural models are
better at capturing the dynamics of speech than simple linear models.
The model parameters
can be found by training the neural network
in a predictive mode on clean speech. As in the previous section, we
consider using only a single network for the entire speech training
set. This is primarily for the sake of explaining the extended Kalman
filtering (EKF) method. The use of different predictive models on
short-term windows to account for the nonstationarity of speech will
be discussed later, in Section 14.4.
Given a linear model
, the well-known Kalman filter algorithm
[14]
optimally combines noisy observations
at each time step with
predictions
(based on previous observations) to produce the
linear least squares estimate of the speech
. In the
linear case with Gaussian statistics, the estimates are the minimum
mean square estimates. With no prior information on
, they reduce
to the maximum-likelihood estimates. To apply the Kalman filter, we
must first put Equations 4 and 5 in state-space
form:
![]() |
|||
![]() |
(8) |
Because the neural network model is nonlinear, the Kalman filter
cannot be applied directly, but requires a linearization of the
nonlinear model at the each time step. The resulting algorithm is
known as the extended Kalman filter (EKF) [14], and
effectively approximates the nonlinear function with a time-varying
linear one. Letting
and
represent
the variances of the process noise
and observation noise
,
respectively, the EKF algorithm is as follows:
Equations 9 and 10, often referred to as
the ``Time-Update,'' produce the a priori prediction of the next
value of the state
, along with the error covariance of this
prediction,
. Equation 13 combines the prediction
with the noisy observation
, using the Kalman gain
term
to provide the optimal tradeoff. The Kalman gain makes use
of the error covariance of the prediction (
) and the
noise variance of the observation (
) to weight the
influences of the prediction and the observation. The error covariance
of the new a posteriori estimate
is computed in
Equation 12, and is necessary to continue the recursion.
These last three equations
(11, 12, 13) are collectively
called the ``Measurement-Update.''
Note that the state-space and Kalman equations can be modified to
accommodate colored noise [14], or to have a fixed
lag to produce a noncausal estimate [15].
The linearization in Equation 10 is required for
propagation of the error covariances
and
,
but results in a suboptimal filter. In fact, for a stationary signal
with fixed SNR, the direct nonlinear time-domain filtering of the
previous subsection may provide a better estimate.
However, with the EKF, the time-specific SNR4
enters through the parameter
.
Thus the EKF avoids the need to train over a range of possible noise
levels (only clean speech is needed for training). This enables the
EKF to easily handle cases where the noise is nonstationary, or where
a broad range of noise levels might be encountered.
The downside of this is that both the variance of the speech
innovations
and the noise variance
(or equivalently, the SNR) must be estimated on-line from the noisy
data. The speech innovations variance
may be
estimated from the expression for the inverse Fourier transform of the
signal power spectrum given an LPC model [16]. Alternatively, an expression may be
derived by noting the relationship between the minimum mean-squared
prediction error for clean speech versus speech with additive noise
[17]. Estimating the innovations variance also requires knowledge of
the SNR. Maximum-likelihood based methods for estimating these
quantities can be derived from the Expectation Maximization (EM)
approach [18]. Methods for estimating the SNR (or full
spectrum of the noise
), which are motivated from speech
perception, are discussed in the next section on spectral methods.
An additional advantage of the predictive approach, is that the state-space (or AR) model provides for a more compact representation. The EKF forms a recursive structure, and thus requires fewer inputs than the direct filtering approach. This tradeoff is analogous to that seen for linear FIR (finite-impulse-response) filters versus IIR (infinite impulse response) filters.
While this EKF method provides a reasonable alternative to the direct filtering method, it should be noted that the approach as it appears in this section has not been reported on in the literature (perhaps because of its assumption of stationarity). Consequently, we do not include experimental results here. The basic technique, however, will form the basis for two important variations that also account for the nonstationarity of speech, to be discussed in Sections 14.4 and 14.5.