Chapter 5: Spoken Output Technologies
Yoshinori Sagisaka
ATR Interpreting Telecommunications Research Laboratories, Tokyo, Japan
Speech synthesis research predates other forms of speech technology by many years. In the early days of synthesis, research efforts were devoted mainly to simulating human speech production mechanisms, using basic articulatory models based on electro-acoustic theories. Though this modeling is still one of the ultimate goals of synthesis research, advances in computer science have widened the research field to include Text-to-Speech (TtS) processing in which not only human speech generation but also text processing is modeled [AHK87]. As this modeling is generally done by a set of rules derived, e.g., from phonetic theories and acoustic analyses, the technology is typically referred to as speech synthesis by rule.
Figure 5.1 shows the configuration of a standard TtS system. In such systems, as represented by MITalk [AHK87], rule-based synthesis has attained highly intelligible speech quality and can already serve in many practical uses. Ceaseless efforts have improved the quality of rule-based synthetic speech, step by step, by alternating speech characteristics analysis with the development of control rules. However, most of this progress has been system dependent, and remains deeply embedded within system architectures in impenetrable meshes of detailed rules and finely tuned control parameters. As a consequence, the expert knowledge that has been incorporated is not available to be shared commonly and can be very hard to replicate in equivalent systems by other researchers.
Figure 5.1: The configuration of a standard TtS system.
In contrast to this traditional rule-based approach, a corpus-based approach has also been pursued. In the corpus-based work, well-defined speech data sets have been annotated at various levels with information, such as acoustic-phonetic labels and syntactic bracketing, to serve as the foundation for statistical modeling. Spectral and prosodic feature parameters of the speech data are analyzed in relation to the labeled information, and their control characteristics are quantitatively described. Based on the results of these analyses, a computational model is created and trained using the corpus. By subsequently applying the resulting model to unseen test data, its validity and any defects can be quantitatively shown. By feeding back results from such tests into the original model with extended training, further improvements can be attained in a cyclical process.
As can be easily seen, these formalised procedures characteristic of the corpus-based approach provide for a clear empirical formulation of the controls underlying speech, and with their specific training procedures and their objective evaluation results, can be easily replicated by other researchers with other databases of equivalently annotated speech. In the last decade, the corpus-based approach has been applied to both spectral and prosodic control for speech synthesis. In the following paragraphs, these speech synthesis research activities will be reviewed, with particular emphasis on the types of synthesis unit, on prosody control and on speaker charateristics. Other important topics, such as text processing for synthesis, and spectral parameters and synthesizers, will be detailed in later sections. Through this introduction to the research activities, it will become clear that the corpus-based approach is the key to understanding current research directions in speech synthesis and to predicting the future outcome of synthesis technology.
In TtS systems, speech units that are typically smaller than words are used to synthesize speech from arbitrary input text. Since there are over 10,000 different possible syllables in English, much smaller units such as phonemes and dyads (phoneme pairs) have typically been modelled. A speech segment's spectral characteristics vary with its phonetic context, as defined by neighboring phonemes, stress and positional differences, and recent studies have shown that speech quality can be greatly affected by these contextual differences (see for example, [OGC93]). However, in traditional rule-based synthesis, though these units have been carefully designed to take into account phonetic variations, no systematic studies have been carried out to determine how and where to best extract the acoustic parameters of units, or of what kind of speech corpus can be considered optimal.
To bring objective techniques into the generation of appropriate speech units, unit-selection synthesis has been proposed [NH88,TAS92,SKIM92]. These speech units can be automatically determined through the analysis of a speech corpus using a measure of entropy on substrings of phone labels [SKIM92]. In unit-selection synthesis, speech units are algorithmically extracted from a phonetically transcribed speech data set using objective measures based on acoustic and phonetic criteria. These measures indicate the contextual adequateness of units and the smoothness of the spectral transitions within and between units. Unlike traditional rule-based concatenation synthesis, speech segments are not limited to one token per type, and various types and sizes of units with different contextual variations are used. The phonetic environments of these units and their precise locations are automatically determined through the selection process. Optimal units to match an input phonetic string are then selected from the speech database to generate the target speech output.
The unit selection process involves a combinatorial search over the entire speech corpus, and consequently, fast search algorithms have been developed for this purpose as an integral part of current synthesis. This approach is in contrast to traditional rule-based synthesis where the design of the deterministic units required insights from the researcher's own knowledge and expertise. The incorporation of sophisticated but usually undescribed knowledge was the real bottleneck that prevented the automatic construction of synthesis systems.
Corpus-based methods provide for a specification of the speech segments required for concatenative synthesis in three factors:
For synthesis of natural-sounding speech, it is essential to control prosody, to ensure appropriate rhythm, tempo, accent, intonation and stress. Segmental duration control is needed to model temporal characteristics just as fundamental frequency control is needed for tonal characteristics. In contrast to the relative sparsity of work on speech unit generation, many quantitative analyses have been carried out for prosody control. Specifically, quantitative analyses and modeling of segmental duration control have been carried out for many languages using massive annotated speech corpora [CG86,BS87,Kla87,Ume75].
Segmental duration is controlled by many language specific and universal factors. In early models, because these control factors were computed independently, through the quantification of control rules, unexpected and serious errors were sometimes seen. These errors were often caused simply by the application of independently derived rules at the same time. To prevent this type of error and to assign more accurate durations, statistical optimization techniques that model the often complex interactions between all the contributing factors have more recently been used.
Traditional statistical techniques such as linear regressive analysis and tree regression analysis have been used for Japanese [KTS92a] and American English [Ril92] respectively. To predict the interactions between syllable and segment level durations for British English a feed-forward neural network has been employed [Cam92]. In this modeling, instead of attempting to predict the absolute duration of segments directly, their deviation from the avarage duration is employed to quantify the lengthening and shortening characteristics statistically. Moreover, hierarchical control has been included by splitting the calculation into the current syllable level and its constituent component levels.
While hierarchical control is desired to simulate human temporal organization mechanisms, it can be difficult to optimize such structural controls globally. Multiple split regression (MSR) uses error minimization at arbitrary hierarchical levels by defining a hierarchical error function [IS93]. MSR incorporates both linear and tree regressions as special cases and interpolates between them by controlling the tiedness of the control parameters. Additive-multiplicative modeling, too, is also an extension of traditional linear analysis techniques, using bilinear expressions and statistical correlation analyses [VS92]. These statistical models can optimize duration control without losing freedom of conditioned exception control.
To generate an appropriate fundamental frequency (
)
contour when given only text as input, an intermediate
prosodic structure needs to be specified. Text processing,
as described in section
, is needed to produce this
intermediate prosodic structure.
characteristics have been
analyzed in relation to prosodic structure by many researchers
[Mae76,HS80,Pie81,LP84,Fuj92]. As
with duration control, in early models,
control rules were
made only by assembling independently analyzed
characteristics. More recently however, statistical models have been
employed to associate
patterns with input linguistic
information directly, without requiring estimates of the intermediate
prosodic structure [Tra92,SKIM92,YTA
93].
In these models, the same mathematical frameworks as used in
duration control; i.e., feed-foward neural networks, linear and tree regression models have been used.
These computational models can be evaluated by comparing duration or
values derived from the predictions of the models with
actual values measured in the speech corpus for the same test input
sentences. Perceptual studies have also been carried out to
measure the effect of these acoustical differences on
subjective evaluation scores by systematically manipulating
the durations [KTS92b]. It is hoped that a systematic
series of perceptual studies will reveal more about human
sensitivities to the naturalness and intelligibility of synthesized
speech scientifically and that time consuming subjective evaluation
will no longer be needed.
Speech waveforms contain not only linguistic information but also speaker voice characteristics, as manifested in the glottal waveform of voice excitation and in the global spectral features representing vocal tract characteristics. The glottal waveform has been manipulated using a glottal source model [FLL85] and female voices (more difficult to model) have been successufuly synthesized. However, it is very difficult to fully automate such parameter extraction procedures and the establishment of an automatic analysis-synthesis scheme is longed for.
As for vocal tract characteristics, spectral conversion methods have been proposed that employ the speaker adaptation technology studied in speech recognition [ANSK90,MMI94,MS95]. This technology is also a good example of the corpus-based approach. By deciding on a spectral mapping algorithm, a measure for spectral distance and a speech corpora for training of the mapping, non-parametric voice conversion is defined. The mapping accuracy can be measured using the spectral distortion measures commonly used in speech coding and recognition.
As indicated in the above paragraphs, speech synthesis will be studied continuously, aiming all the while at more natural and intelligible speech. It is quite certain that TtS technology will create new speech output applications associated with the improvement of speech quality. To accelerate this improvement, it is necessary to pursue research on speech synthesis in such a way that each step forward can be evaluated objectively and can be shared among researchers. To this end, a large amount of commonly available data is indispensable, and objective evaluation methods should be pursued in relation to perceptual studies. An important issue of concern to speech synthesis technology is the variability of output speech. As illustrated by recent advances in speaker characteristics control, the adaptation of vocal characteristics is one dimension of such variability. We also have to consider variabilities resulting from human factors, such as speaking purpose, utterance situation and the speaker's mental states. These paralinguistic factors cause changes in speaking styles reflected in a change of both voice quality and prosody. The investigation of these variations will contribute to elaborate synthetic speech quality and widen its application fields.
Such progress is not only restricted to TtS technology; future technologies related to the furtherance of human capabilities are also being developed. Human capabilities such as the acquisition of spoken language bear strong relations to the knowledge acquisition used in developing speech synthesis systems. Useful language training tools and educational devices can therefore be expected to come out of the pursuit and modeling of such knowledge acquisition processes. The corpus-based approach is well suited to this purpose, and inductive learning from speech corpora will give us hints on the directions this research must take. To pursue these new possiblities, it is essential for speech synthesis researchers to collaborate with researchers in other fields related to spoken language, and to freshly introduce the methodologies and knowledge acquired in those encounters.