next up previous contents
Next: Neural Transform-Domain Methods Up: Neural Time-Domain Filtering Methods Previous: Direct Time-Domain Mapping


Extended Kalman Filtering with Predictive Models


An alternative approach to time-domain filtering combines predictive modeling with a state-space estimation procedure. The underlying assumption is that the noisy speech $ y_k$ can be accurately modeled as a nonlinear autoregression (AR) with both process and additive observation noise:

$\displaystyle x_k$ $\displaystyle =$ $\displaystyle f(x_{k-1},...x_{k-M},{\bf w}) + v_k$ (4) 
$\displaystyle y_k$ $\displaystyle =$ $\displaystyle x_k + n_k,$ (5) 

where $ x_k$ corresponds to the true underlying speech signal driven by process noise $ v_k$, and $ f(\cdot)$ is a nonlinear (neural network) function of past values of $ x_k$ and parameters $ {\bf w}$. As before, $ y_k$ is the corrupted speech signal, which contains additive noise $ n_k$. Both $ v_k$ and $ n_k$ are initially assumed to be white (though not necessarily Gaussian). In principle, a channel could be incorporated by replacing Equation 5 with Equation 1; this would require an additional function approximator to represent $ h(\cdot)$. For simplicity, we will ignore channel effects in this section.

Note that if $ f(\cdot)$ is linear, the AR formulation corresponds to the classic Linear Predictive Coding (LPC) model of speech. It has been demonstrated (see Tishby [13]) that neural models are better at capturing the dynamics of speech than simple linear models.

The model parameters $ \bw$ can be found by training the neural network in a predictive mode on clean speech. As in the previous section, we consider using only a single network for the entire speech training set. This is primarily for the sake of explaining the extended Kalman filtering (EKF) method. The use of different predictive models on short-term windows to account for the nonstationarity of speech will be discussed later, in Section 14.4.

Given a linear model $ f(\cdot)$, the well-known Kalman filter algorithm [14] optimally combines noisy observations $ y_k$ at each time step with predictions $ \hat{x}^-_k$ (based on previous observations) to produce the linear least squares estimate of the speech $ \hat{x}_k$. In the linear case with Gaussian statistics, the estimates are the minimum mean square estimates. With no prior information on $ x$, they reduce to the maximum-likelihood estimates. To apply the Kalman filter, we must first put Equations 4 and 5 in state-space form:

$\displaystyle {\bf x}_k$ $\displaystyle =$ $\displaystyle F[{\bf x}_{k-1}] + Bv_k,$ (6) 
$\displaystyle y_k$ $\displaystyle =$ $\displaystyle C{\bf x}_k + n_k,\vspace{-.2in}$ (7) 

where

$\displaystyle {\bf x}_k$ $\displaystyle = \left[ \begin{array}{l} x_k \\ x_{k-1} \\ \vdots \\ x_{k-M+1} \......,\ldots,x_{k-M+1},{\bf w}) \\ x_k \\ \vdots \\ x_{k-M+2} \\ \end{array} \right]$  
$\displaystyle C$ $\displaystyle = \left[ \begin{array}{cccc} 1 & 0 & \cdots & 0 \end{array} \right], \hspace{.55in} B = C^T.$ (8) 

Because the neural network model is nonlinear, the Kalman filter cannot be applied directly, but requires a linearization of the nonlinear model at the each time step. The resulting algorithm is known as the extended Kalman filter (EKF) [14], and effectively approximates the nonlinear function with a time-varying linear one. Letting $ \sigma^2_{v,k}$ and $ \sigma^2_{n,k}$ represent the variances of the process noise $ v_k$ and observation noise $ n_k$, respectively, the EKF algorithm is as follows:

$\displaystyle \bxh^-_k$ $\displaystyle =$ $\displaystyle F[\bxh_{k-1} , \bwh_{k-1}]$ (9) 
$\displaystyle P^-_{\bxh,k}$ $\displaystyle =$ $\displaystyle AP_{\bxh,k-1}A^T + B\sigma^2_{v,k}B^T\quad ,$    where$\displaystyle \qquad A = \frac{\partial F[\bxh_{k-1},\bwh]}{\partial \bxh_{k-1}}$ (10) 
$\displaystyle K_k$ $\displaystyle =$ $\displaystyle P^-_{\bxh,k}C^T(CP^-_{\bxh,k}C^T + \sigma^2_{n,k})^{-1}$ (11) 
$\displaystyle P_{\bxh,k}$ $\displaystyle =$ $\displaystyle (I - K_kC)P^-_{\bxh,k}$ (12) 
$\displaystyle \bxh_k$ $\displaystyle =$ $\displaystyle \bxh^-_k + K_k(y_k - C\bxh^-_k).$ (13) 

Equations 9 and 10, often referred to as the ``Time-Update,'' produce the a priori prediction of the next value of the state $ \bx_k$, along with the error covariance of this prediction, $ P^-_{\bxh,k}= Cov(\bx_k -\bxh^-_k)$. Equation 13 combines the prediction $ \bxh^-_k$ with the noisy observation $ y_k$, using the Kalman gain term $ K_k$ to provide the optimal tradeoff. The Kalman gain makes use of the error covariance of the prediction ( $ P^-_{\bxh,k}$) and the noise variance of the observation ( $ \sigma^2_{n,k}$) to weight the influences of the prediction and the observation. The error covariance of the new a posteriori estimate $ \bxh_k$ is computed in Equation 12, and is necessary to continue the recursion. These last three equations (111213) are collectively called the ``Measurement-Update.'' Note that the state-space and Kalman equations can be modified to accommodate colored noise [14], or to have a fixed lag to produce a noncausal estimate [15].

The linearization in Equation 10 is required for propagation of the error covariances $ P_{\bxh,k}$ and $ P^-_{\bxh,k}$, but results in a suboptimal filter. In fact, for a stationary signal with fixed SNR, the direct nonlinear time-domain filtering of the previous subsection may provide a better estimate. However, with the EKF, the time-specific SNR4 enters through the parameter $ \sigma^2_{n,k}$. Thus the EKF avoids the need to train over a range of possible noise levels (only clean speech is needed for training). This enables the EKF to easily handle cases where the noise is nonstationary, or where a broad range of noise levels might be encountered.

The downside of this is that both the variance of the speech innovations $ \sigma^2_{v,k}$ and the noise variance $ \sigma^2_{n,k}$ (or equivalently, the SNR) must be estimated on-line from the noisy data. The speech innovations variance $ \sigma_{v,k}^2$ may be estimated from the expression for the inverse Fourier transform of the signal power spectrum given an LPC model [16]. Alternatively, an expression may be derived by noting the relationship between the minimum mean-squared prediction error for clean speech versus speech with additive noise [17]. Estimating the innovations variance also requires knowledge of the SNR. Maximum-likelihood based methods for estimating these quantities can be derived from the Expectation Maximization (EM) approach [18]. Methods for estimating the SNR (or full spectrum of the noise $ n_k$), which are motivated from speech perception, are discussed in the next section on spectral methods.

An additional advantage of the predictive approach, is that the state-space (or AR) model provides for a more compact representation. The EKF forms a recursive structure, and thus requires fewer inputs than the direct filtering approach. This tradeoff is analogous to that seen for linear FIR (finite-impulse-response) filters versus IIR (infinite impulse response) filters.

While this EKF method provides a reasonable alternative to the direct filtering method, it should be noted that the approach as it appears in this section has not been reported on in the literature (perhaps because of its assumption of stationarity). Consequently, we do not include experimental results here. The basic technique, however, will form the basis for two important variations that also account for the nonstationarity of speech, to be discussed in Sections 14.4 and 14.5.




next up previous contents
Next: Neural Transform-Domain Methods Up: Neural Time-Domain Filtering Methods Previous: Direct Time-Domain Mapping   Contents