CSLU has recently increased its activities in the area of speech generation, and is now focusing on two key areas of research and development, small footprint speech synthesis and very high quality application-specific synthesis. By way of introduction, speech generation is generally accomplished by one of the following three methods:
The strengths and weaknesses of these methods are complementary. As for speech quality and scope, general-purpose concatenative synthesis is able to handle any input sentence but generally produces mediocre quality. Corpus based synthesis can produce very high quality, but only if its speech corpus contains the right phoneme sequences with the right prosody for a given input sentence. If the corpus contains the right phonemes but with the wrong prosody, the end result may locally (i.e., within the range of a phoneme sequence that was available in the corpus) sound quite good, but the utterance as a whole may have a bizarre sing-song quality with confusing accelerations and decelerations. And, obviously, phrase splicing methods produce completely natural speech, but can only say the pre-stored phrases or combinations of sentence frames and slot items; naturalness can be a problem if the slot items are not carefully matched to the sentence frames in terms of prosody.
An additional issue to consider is the amount of work required to build a system. The cost of generating a corpus or an acoustic unit inventory is significant, because besides making the speech recordings, each recording has to be analyzed microscopically by hand to determine phoneme boundaries, phoneme labels, and other tags. Such time consuming analysis is not necessary for phrase splicing methods. On the other hand, applications involving names may be prohibitive for phrase splicing methods (In the US, there are 1.5 million distinct last names!).
A final consideration is size. Although the prices of memory and disk space are continually dropping, being able to have more channels on a given hardware platform translates directly into increased profits, and there is also an increasing interest in using speech synthesis on handheld devices. Thus, size still matters. Concatenative synthesis has the edge on size. Moreover, its quality limitations are less of a problem because the acoustic capabilities of handheld devices are themselves limited.
In other words, each of these methods has problems with quality, scope, the amount of resources required, or size. CSLU's speech generation research and development work is targeted to cut through the barriers created by these tradeoffs, by focusing on the following projects:
CSLU software is targeted to be compatible with any prosody markup language
enabled speech synthesis engine (e.g., as documented in the most recent
W3C speech synthesis markup specifications or similar specifications such
as Sable, with which CSLU has a close involvement. )
Besides communicating with TTS engines using a markup language,
our software can also be integrated in an existing TTS engine that has
an sufficiently rich internal data structure, such as
Festival.
As a preview of things to come, here are some sentences produced using new acoustic
inventories and signal processing components developed at CSLU, coupled with prosodic
models from a commercial TTS engine (these are TTS, not copy synthesis):