The Free Speech Journal, Issue # 5(1997)


Published (10/22/97).
©1997 All rights reserved.

Phoneme Probability Estimation with Dynamic Sparsely Connected Artificial Neural Networks

Nikko Ström, nikko@speech.kth.se
Department of Speech, Music and Hearing, KTH, Stockholm, Sweden Centre for Speech Technology, KTH, Stockholm, Sweden

Abstract

This paper presents new methods for training large neural networks for phoneme probability estimation. An architecture combining time-delay windows and recurrent connections is used to capture the important dynamic information of the speech signal. Because the number of connections in a fully connected recurrent network grows super-linear with the number of hidden units, schemes for sparse connection and connection pruning are explored. It is found that sparsely connected networks outperform their fully connected counterparts with an equal number of connections. The implementation of the combined architecture and training scheme is described in detail. The networks are evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT database, and for word recognition on the WAXHOLM database. The achieved phone error-rate, 27.8%, for the standard 39 phoneme set on the core test-set of the TIMIT database is in the range of the lowest reported. All training and simulation software used is made freely available by the author, and detailed information about the software and the training process is given in an Appendix.

Table of Contents

Abstract
Table of Contents
1. Introduction
2. Basic Theory
2.1. Feed-forward networks
2.2. Recurrent connections and time-delay
2.3. Back-propagation through time
2.4. Weight updating scheme
2.5. Weight initialization
2.6. Interpretation of the output activation values
2.7 The "softmax" output activation function
3. Phoneme probability estimation
3.1 Input feature representation
3.2. Delta coefficients
3.3. Input normalization
3.4. Network topology
3.5. Dynamic decoding
4. Pruning and sparse connection
4.1. Connection pruning
4.2. Sparse connection
5. Recognition results
5.1. Phoneme recognition results on the TIMIT database
5.1.1. Experiments with varying connectivity
5.1.2. Decoupling the output units
5.1.3. Varying the hidden layer size
5.1.4. Connection pruning in trained networks
5.1.5. Computational considerations
5.1.6. Test of the interpretation of activities as a posteriori probabilities
5.2. Word recognition results on the WAXHOLM human/machine dialogue task
6. Conclusions
7. Acknowledgements
8. References
Appendix: The NICO neural network toolkit
A.1. Structuring the database
A.2. Feature extraction
A.3. Generation of phoneme targets
A.4. Specifying the network structure
A.5. Training
A.6. Frame-level evaluation
A.7 Connection Pruning