This paper presents new methods for training large neural networks for
phoneme probability estimation. An architecture combining time-delay windows
and recurrent connections is used to capture the important dynamic information
of the speech signal. Because the number of connections in a fully connected
recurrent network grows super-linear with the number of hidden units, schemes
for sparse connection and connection pruning are explored. It is found
that sparsely connected networks outperform their fully connected counterparts
with an equal number of connections. The implementation of the combined
architecture and training scheme is described in detail. The networks are
evaluated in a hybrid HMM/ANN system for phoneme recognition on the TIMIT
database, and for word recognition on the WAXHOLM database. The achieved
phone error-rate, 27.8%, for the standard 39 phoneme set on the core test-set
of the TIMIT database is in the range of the lowest reported. All training
and simulation software used is made freely available by the author, and
detailed information about the software and the training process is given
in an Appendix.