Hervé Bourlard
& Nelson Morgan
Faculté Polytechnique de Mons, Mons, Belgium
International Computer Science Institute, Berkeley, California, USA
There are several motivations for the use of connectionist systems in human language technology. Some of these are:
In the following, we briefly review some of the typical functional
building blocks for HLT, and show how connectionist techniques
could be used to improve them. In subsection
, we discuss a particular
instance that we are experienced in using. Finally, in the last
subsection, we discuss some key research problems.
Feature extraction consists of transforming the raw input data into a concise representation that contains the relevant information and is robust to variations. For speech, for instance, the waveform is typically translated into some kind of a function of a short term spectrum. For handwriting recognition, pixels are sometimes complemented by dynamic information before they are translated into task-relevant features.
It would be desirable to automatically determine the parameters or
features for a particular HLT task.
In some limited cases, it appears to be possible to automatically
derive features from raw data, given significant application-specific
constraints. This is the case for the AT&T handwritten zip code recognizer [lCBD
90], in which a simple
convolutional method was used to extract important features such as
lines and edges that are used for classification by an ANN.
Connectionist networks have also been used to investigate a number of other approaches to unsupervised data analysis, including linear dimension reduction [including [BK88], in which it was shown that feedforward networks used in auto-associative mode are actually performing principal component analysis (PCA)], non-linear dimension reduction [Oja91,KL94], Reference-point based classifiers, such as vector quantization vector quantization and topological map [Koh88].
It is however often better to make use of any task-specific knowledge whenever it is possible to reduce the amount of information to be processed by the network and to make its task easier. As a consequence, we note that, while automatic feature extraction is a desirable goal, most ASR systems use neural networks to classify speech sounds using standard signal processing tools (like Fourier transform) or features that are selected by the experimenter (e.g., [CFGJ91]).
Although ANNs have been shown to be quite powerful in static pattern classification, their formalism is not very well suited to address most issues in HLT. Indeed, in most of these cases, patterns are primarily sequential and dynamical. For example, in both ASR and handwriting recognition, there is a time dimension or a sequential dimension which is highly variable and difficult to handle directly in ANNs. We note however that ANNs have been successfully applied to time series prediction in several task domains [WG94]. HLT presents several challenges. In fact, many HLT problems can be formulated as follows: how can an input sequence (e.g., a sequence of spectra in the case of speech and a sequence of pixel vectors in the case of handwriting recognition) be properly explained in terms of an output sequence (e.g., sequence of phonemes, words or sentences in the case of ASR or a sequence of written letters, words or phrases in the case of handwriting recognition) when the two sequences are not synchronous (since there usually are multiple inputs associated with each pronounced or written word)?
Several neural network architectures have been developed for (time)
sequence classification, including:
In the case of ASR, all of these models have been shown to yield good performance (sometimes better than HMMs) on short isolated speech units. By their recurrent aspect and their implicit or explicit temporal memory they can perform some kind of integration over time. This conclusion remains valid for related HLT problems. However, neural networks by themselves have not been shown to be effective for large scale recognition of continuous speech or cursive handwriting. The next section describes a new approach that combines ANNs and HMMs for large vocabulary continuous speech recognition.
Most commonly, the basic
technological approach for automatic speech
recognition (ASR) is statistical pattern recognition using
hidden Markov models (HMMs) as presented in sections 1.5
and
.
The HMM formalism has also been applied to other HLT
problems such as handwriting recognition
[CKZ94].
Recently, a new formalism of classifiers particularly well suited to sequential patterns (like speech and handwritten text) and which combines the respective properties of ANNs and HMMs was proposed and successfully used for difficult ASR (continuous speech recognition) tasks [BM93]. This system, usually referred to as the hybrid HMM/ANN combines HMM sequential modeling structure with ANN pattern classification [BM93]. Although this approach is quite general and recently was also used for handwriting recognition [SGH94,SGH95] and speaker verification [NL94], the following description will mainly apply to ASR problems.
As in standard HMMs, hybrid HMM/ANN systems applied to ASR use a Markov process to temporally model the speech signal. The connectionist structure is used to model the local feature vector conditioned on the Markov process. For the case of speech this feature vector is local in time, while in the case of handwritten text it is local in space. This hybrid is based on the theoretical result that ANNs satisfying certain regularity conditions can estimate class (posterior) probabilities for input patterns [BM93]; i.e., if each output unit of an ANN is associated with each possible HMM state, it is possible to train ANNs to generate posterior probabilities of the state conditioned on the input. This probability can then be used, after some modifications [BM93], as local probabilities in HMMs.
Advantages of the HMM/ANN hybrid for speech recognition include:
In recent years these hybrid approaches have been compared with the best classical HMM approaches on a number of HLT tasks. In cases where the comparison was controlled (e.g., where the same system was used in both cases except for the means of estimating emission probabilities), the hybrid approach performed better when the number of parameters were comparable, and about the same for some cases in which the classical system used many more parameters. Also, the hybrid system was quite efficient in terms of CPU and memory run-time requirements. Evidence for this can be found in a number of sources, including:
94] in which results on Resource Management (a standard reference database for testing ASR
systems) are presented, and
More generally, though, complete systems achieve their performance through detailed design, and comparisons are not predictable on the basis of the choice of the emission probability estimation algorithm alone.
ANNs can also be incorporated in a hybrid HMM system by training the former to do nonlinear prediction [Lev93], leading to a nonlinear version of what is usually referred to as autoregressive HMMs [JR85].
Connectionist approaches have also been applied to natural language processing. Like the acoustic case, NLP requires the sequential processing of symbol sequences (word sequences). For example, HMMs are a particular case of a FSM, and the techniques used to simulate or improve acoustic HMMs are also valid for language models. As a consequence, much of the work on connectionist NLP has used ANNs to simulate standard language models like FSMs.
In 1969, [MP69], showed that
ANNs can be used to simulate a FSM. More recently,
several works showed that recurrent networks simulate or validate
regular and context-free grammars. For instance, in
[LSC
90], a recurrent network feeding back output
activations to the previous (hidden) layer was used to validate a
string of symbols generated by a regular grammar. In
[SCG
90], this was extended to CFGs.
Structured connectionist parsers were developed by a number
of researchers, including [Fan85] and [Jai92].
The latter parser was incorporated in a speech-to-speech
translation system (for a highly constrained
conference-registration task) that was described in
[WJM
92].
Neural networks have also been used to model semantic relations. There have been many experiments of this kind over the years. For example, [Elm88] showed that neural networks can be trained to learn pronoun reference. He used a partially recurrent network for this purpose, consisting of a feedforward MLP with feedback from the hidden layer back into the input.
The work reported so far has focused on simulating standard approaches with neural networks, and it is not yet known whether this can be helpful in the integration of different knowledge sources into a complete HLT system. Generally speaking, connectionist language modeling and NLP has thus far played a relatively small role in large or difficult HLT tasks.
There are many open problems in applying connectionist approaches to HLT, and in particular for ASR, including:
These are all long term research issues. Many intermediate problems will have to be solved before anything like an optimal solution can be found.