next up previous contents
Next: Summary and Conclusions Up: On-line Iterative Methods Previous: Maximum-likelihood Estimation and Dual


Noise-Regularized Adaptive Filtering

This approach involves a window based and iterative process that is similar to the dual EKF method, but does not use an AR model for the speech. Rather, the approach utilizes the same architecture as in the direct time-domain mapping filters of Section 14.2.1 and Figure 14.1, while still avoiding the need for a clean dataset to train the network.

Recall that the direct filtering approaches attempt to map a noisy vector of speech $ {\bf y}_k=\left[ y_{k-M/2} \ldots y_k \ldots
y_{k+M/2} \right]^T$ directly to an estimate of the speech signal $ \hat{x}_k =
f({\bf y}_k)$. The neural network, $ f(\cdot)$, is trained by minimizing the mean-square error (MSE) cost,

$\displaystyle \min_f E[(x_k-f({\bf y}_k))^2] \ ,$ (19) 

where the corresponding optimal solution is given by the the conditional mean. (For illustrative purposes we will consider single-output filters, though the approach can also be extended to the MIMO case). In Section 14.2.1, training was performed by assuming the clean signal $ x_k$ is known. However, consider the expansion:

$\displaystyle E[(x_k-f(\mathbf{y}_k))^2] = E[(y_k-f({\bf y}_k))^2] + 2E[n_k \cdot f({\bf y}_k)] - 2E[y_k\cdot n_k]+E[n^2_k]$ (20) 

The last two terms are independent of $ f(\cdot)$. Thus the optimal $ f(\cdot)$ can be found by minimizing the alternative cost,

$\displaystyle \min_f \left\{ E[(y_k-f({\bf y}_k))^2] +2E[n_k\cdot f({\bf y}_k)] \right\} \ .$ (21) 

The advantage of this formulation is that the first term only depends on the observed noisy speech, whereas the expectation in the second term can be evaluated using only knowledge of the noise statistics. The clean speech is not needed. The first term can also be viewed as the cost associated with filtering the noisy signal to itself (training the network with noisy target data). The second term corresponds to the expected product between the noise and the neural network output, and acts to regularize the weights of the network to prevent the filter $ f(\cdot)$ from simply becoming the identity map. The resulting estimator is referred to as a Noise-Regularized Adaptive Filter (NRAF) [54].

The evaluation of the regularization term cannot be performed analytically. Instead, an approximate solution is found using the Unscented Transformation (UT), a method for calculating the statistics of a random variable which undergoes a nonlinear transformation [55]. This method involves propagating a set of $ M$ vectors (based on the first and second order statistics of the signals) through the network at each time step, and then forming a weighted sample mean. The cost function (including the regularization term) can still be minimized with respect to the weights of the network by gradient based descent using standard backpropagation. The only assumption made by this approach is that the second-order approximation to the regularization term using the UT is sufficiently accurate to allow convergence of the network to the true minimum MSE (this issue requires further investigation).

As in the Dual EKF approach, to address the nonstationarity of speech, the noisy data is windowed into short overlapping frames and a new filter is designed for each frame. The method also requires estimation of the noise statistics.

We report results using samples from the OGI Speech Enhancement Assessment Resource (SpEAR) [56]. The experimental setup consisted of 8kHz speech, 600 point frames (overlap of 4), filter window length of 25, and two-layer feedforward networks with $ 11$ hidden nodes. In addition, the raw noisy speech window was first embedded using a fixed Karhunen-Lu$ \acute{e}$ve transform (KLT) [50] with embedding dimension of 19. These values were found empirically by cross-validation. Figure 14.11 summarizes the performance on a sample speech sentence for a number of different noise sources. The nonlinear NRAF algorithm is compared to linear NRAF (neural network replaced by linear FIR filter) as well as standard spectral subtraction. The nonlinear NRAF filter shows a clear improvement.

Figure: Comparison of segmental SNR performance for different noise sources 1) white (input SNR = 6.08 , segmental SNR = 1.55 ), 2) pink (input SNR = 4.34, segmental SNR = 0.3 ) , 3) factory (input SNR = 5.16, segmental SNR = 1.07), 4) F16 (input SNR = 4.61, segmental SNR = .46). Algorithms: a) standard implementation of spectral subtraction with 256 point frames (Duke Speech Processing Toolkit) , b) linear NRAF with KLT, c) nonlinear NRAF. Note for these experiments a 3dB improvement in segmental SNR corresponds to approximately 5dB improvement in SNR (Segmental SNR has been shown to correlate more closely with subjective quality evaluations [4]).

One appealing aspect of the method is that it unifies a number of traditional approaches while extending them to the nonlinear domain. In the case where the neural network is replaced by a linear filter, the resulting estimator reduces to a simple time-domain implementation of spectral subtraction. For four-layer networks with multiple outputs (block estimation), the number of hidden neurons can be made less than the number of outputs forcing an embedding of the input space. In this case, the method has the potential to provide nonlinear component analysis in contrast to the linear embedding used in traditional signal subspace approaches. An additional research direction would be to use larger networks which have been pre-trained using the clean training set scenario, and then use the NRAF approach to tune these networks on-line. In summary, the NRAF approach has only recently been proposed, and while promising, further investigation is still necessary to fully characterize its potential.13




next up previous contents
Next: Summary and Conclusions Up: On-line Iterative Methods Previous: Maximum-likelihood Estimation and Dual   Contents