This approach involves a window based and iterative process that is similar to the dual EKF method, but does not use an AR model for the speech. Rather, the approach utilizes the same architecture as in the direct time-domain mapping filters of Section 14.2.1 and Figure 14.1, while still avoiding the need for a clean dataset to train the network.
Recall that the direct filtering approaches attempt to map a noisy vector of speech
directly to an estimate of the speech signal
. The neural network,
, is trained
by minimizing the mean-square error (MSE) cost,
| (19) |
| (20) |
The evaluation of the regularization term cannot be performed
analytically. Instead, an approximate solution is found using the
Unscented Transformation (UT), a method for calculating the
statistics of a random variable which undergoes a nonlinear
transformation [55]. This method involves propagating a set
of
vectors (based on the first and second order statistics of the signals)
through the network at each time step, and then forming a weighted
sample mean. The cost function (including the regularization term) can
still be minimized with respect to the weights of the network by
gradient based descent using standard backpropagation. The
only assumption made by this approach is that the second-order
approximation to the regularization term using the UT is sufficiently
accurate to allow convergence of the network to the true minimum MSE
(this issue requires further investigation).
As in the Dual EKF approach, to address the nonstationarity of speech,
the noisy data is windowed into short overlapping frames and a new
filter is designed for each frame.
The method also requires estimation of the noise statistics.
We report results using samples from the OGI Speech Enhancement
Assessment Resource (SpEAR) [56]. The experimental setup
consisted of 8kHz speech, 600 point frames (overlap of 4), filter
window length of 25, and two-layer feedforward networks with
hidden nodes. In addition, the raw noisy speech window was first
embedded using a fixed Karhunen-Lu
ve transform (KLT) [50] with
embedding dimension of 19. These values were found empirically by
cross-validation. Figure 14.11 summarizes the performance on
a sample speech sentence for a number of different noise sources. The
nonlinear NRAF algorithm is compared to linear NRAF (neural network
replaced by linear FIR filter) as well as standard spectral
subtraction. The nonlinear NRAF filter shows a clear improvement.
![]() |
One appealing aspect of the method is that it unifies a number of traditional approaches while extending them to the nonlinear domain. In the case where the neural network is replaced by a linear filter, the resulting estimator reduces to a simple time-domain implementation of spectral subtraction. For four-layer networks with multiple outputs (block estimation), the number of hidden neurons can be made less than the number of outputs forcing an embedding of the input space. In this case, the method has the potential to provide nonlinear component analysis in contrast to the linear embedding used in traditional signal subspace approaches. An additional research direction would be to use larger networks which have been pre-trained using the clean training set scenario, and then use the NRAF approach to tune these networks on-line. In summary, the NRAF approach has only recently been proposed, and while promising, further investigation is still necessary to fully characterize its potential.13