next up previous contents index
Next: 5.3 Text Interpretation for TtS Synthesis Up: 5 Spoken Output Technologies Previous: 5.1 Overview

5.2 Synthetic Speech Generation

Christophe d'Alessandro & Jean-Sylvain Liénard
LIMSI-CNRS, Orsay, France

Speech generation is the process which allows the transformation of a string of phonetic and prosodic symbols into a synthetic speech signal. The quality of the result is a function of the quality of the string, as well as of the quality of the generation process itself. For a review of speech generation in English the reader is referred to [FR73] and [Kla87]. Recent developments can be found in [BB92], and in [VSSOH95].

Let us examine first what is requested today from a text-to-speech (TtS) system. Usually two quality criteria are proposed. The first one is intelligibility, which can be measured by taking into account several kinds of units (phonemes, syllables, words, phrases). The second one, more difficult to define, is often labeled as pleasantness or naturalness. Actually the concept of naturalness may be related to the concept of realism in the field of image synthesis: the goal is not to restitute the reality but to suggest it. Thus, listening to a synthetic voice must allow the listener to attribute this voice to some pseudo-speaker and to perceive some kind of expressivity as well as some indices characterizing the speaking style and the particular situation of elocution. For this purpose the corresponding extra-linguistic information must be supplied to the system [GN92].

Most of the present TtS systems produce an acceptable level of intelligibility, but the naturalness dimension, the ability to control expressivity, speech style and pseudo-speaker identity still are poorly mastered. Let us mention however that users demands vary to a large extent according to the field of application: general public applications such as telephonic information retrieval need maximal realism and naturalness, whereas some applications involving professionals (process or vehicle control) or highly motivated persons (visually impaired, applications in hostile environments) demand intelligibility with the highest priority.

5.2.1 Input to the Speech Generation Component

The input string to the speech generation component is basically a phonemic string resulting from the grapheme to phoneme converter. It is usually enriched with a series of prosodic marks denoting the accents and pauses. With few exceptions the phoneme set of a given language is well defined; thus the symbols are not ambiguous. However the transcript may represent either a sequence of abstract linguistic units (phonemes) or a sequence of acoustic-phonetic units (phones or transitional segments). In the former case (phonological or normative transcript) it may be necessary to apply some transformations to obtain the acoustical transcript. In order to make this distinction clearer let us take a simple example in French. The word ``médecin'' (medical doctor) may appear in a pronunciation dictionary as ``mé--de--cin'' /me--d--s/, which is perfectly correct. But when embedded in a sentence it is usually pronounced in a different way ``mèt--cin'' /mt--s/. The tense vowel ``é'' /e/ is realized as its lax counterpart ``è'' //,

the ``e'' // disappears, the three syllables are replaced by only two, and the voicing of the plosive /d/ is neutralized by the presence of the unvoiced /s/ which follows. Without such rules the output of the synthesizer may be intelligible, but it may be altered from the point of view of naturalness. Such transformations are not simple; they imply not only a set of phonological rules, but also some considerations on the speech style, as well as on the supposed socio-geographical origin of the pseudo-speaker, and on the speech rate.

Analogously, the prosodic symbols must be processed differently according to their abstraction level. But the problem is more difficult, because there is no general agreement in the phonetic community on a set of prosodic marks that would have a universal value, even within the framework of a given language. A noticeable exception is the ToBI system, for transcription of English [PBH94]. Each synthesis system defines its own repertory of prosodic entities and symbols, that can be classified into three categories: phonemic durations, accents and pauses.

5.2.2 Prosody Generation

Usually only the accents and pauses, deduced from the text, are transcribed in the most abstract form of the prosodic string. But this abstract form has to be transformed into a flow of parameters in order to control the synthesizer. The parameters to be computed include the fundamental frequency (), and the duration of each speech segment as well as its intensity and timber. A melodic (or intonational) model and a duration model are needed to implement the prosodic structure computed by the text processing component of the speech synthesizer.

evolution, often considered the main support of prosody, depends as do the phonemic durations on phonetic, lexical, syntactic and pragmatic factors. Depending on the language under study, the melodic model is built on different levels, generally the word level (word accent) and the sentence or phrase level (phrase accent). The aim of the melodic model is to compute curves. Three major types of melodic models are currently in use for generation. The first type of melodic model is production-oriented. It aims at representing the commands governing generation. This type of model associates melodic commands with word and phrase accents. The melodic command is either an impulse or a step signal. The contour is obtained as the response of a smoothing filter to these word and phrase commands [FK88]. The second type of melodic model is rooted in perception research [HCC90]. Synthetic contours are derived from stylized natural contours. At the synthesis stage, the curves are obtained by concatenation of melodic movements: rises, falls, and flat movements. Automatic procedures for pitch contour stylization have been developed [dM95]. In the last type of melodic model, curves are implemented as a set of target values, linked by interpolation functions [Pie81].

The phonemic durations result from multifold considerations. They are in part determined from the mechanical functioning of the synthesizer when the latter is of articulatory nature, or from the duration of the prerecorded segments in the case of concatenative synthesis. Another part is related to the accent. Another one, reflecting the linguistic function of the word in the sentence, is usually related to the syntactic structure. Finally, the last part is related to the situation and pseudo-speaker's characteristics (speech rate, dialect, stress, etc.).

Two or three levels of rules are generally present in durational models. The first level represents co-intrinsic duration variations (i.e., the modification of segment durations that are due to their neighbors). The second level is the phrase level: modification of durations that are due to prosodic phrasing. Some systems also take into account a third level, the syllabic level [CI91].

The other prosodic parameters (intensity, timber) are usually implicitly fixed from the start. However, some research is devoted to voice quality characterization or differences between male and female voices [KK90].

One of the most difficult problems in speech to date is prosodic modeling. A large body of problems come from text analysis (see section gif). But there is also room for improvement in both melodic and durational models. In natural speech the prosodic parameters interact in a way that is still unknown, in order to supply the listener with prosodic information while keeping the feeling of fluentness. Understanding the interplay of these parameters is today one of the hottest topics for research on speech synthesis. For prosodic generation, a move from rule-based modeling to statistical modeling is noticeable, as in many areas of speech and language technology [VS94].

5.2.3 Speech Signal Generation

The last step for speech output is synthesis of the waveform, according to the segmental and prosodic parameters defined at earlier stages of processing.

Speech signal generators (the synthesizers) can be classified into three categories: (1) articulatory synthesizers, (2) formant synthesizers, and (3) concatenative synthesizers. Articulatory synthesizers are physical models based on the detailed description of the physiology of speech production and on the physics of sound generation in the vocal apparatus [PC92]. Typical parameters are the position and kinematics of articulators. Then the sound radiated at the mouth is computed according to equations of physics. This type of synthesizer is rather far from applications and marketing because of its cost in terms of computation and the underlying theoretical and practical problems still unsolved.

Formant synthesis is a descriptive acoustic-phonetic approach to synthesis [AHK87]. Speech generation is not performed by solving equations of physics in the vocal apparatus, but by modeling the main acoustic features of the speech signal [Kla80,SB91]. The basic acoustic model is the source/filter model. The filter, described by a small set of formants, represents articulation in speech. It models speech spectra that are representative of the position and movements of articulators. The source represents phonation. It models the glottal flow or noise excitation signals. Both source and filter are controlled by a set of phonetic rules (typically several hundred). High-quality rule-based formant synthesizers, including multilingual systems, have been marketed for many years.

Concatenative synthesis is based on speech signal processing of natural speech databases. The segmental database is built to reflect the major phonological features of a language. For instance, its set of phonemes is described in terms of diphone units, representing the phoneme-to-phoneme junctures. Non-uniform units are also used (diphones, syllables, words, etc.). The synthesizer concatenates (coded) speech segments, and performs some signal processing to smooth unit transitions and to match predefined prosodic schemes. Direct pitch-syncronous waveform processing is one of the most simple and popular concatenation synthesis algorithms [MC90b]. Other systems are based on multipulse linear prediction [AR82], or harmonic plus noise models [LSM93,DL93,Rd94].

Several high-quality concatenative synthesizers, including multilingual systems, are marketed today.

Trends in Speech Generation

Perceptive assessment lies among the most important aspects of speech synthesis research [VBP90,VS93,KP95]. When one works on phonetic rule definition or segment concatenation, a robust and quick assessment methodology is absolutely necessary to improve the system. Besides, it is also necessary in order to compare the systems to each other. As far as speech naturalness is concerned the problem is still almost untouched. Nobody knows what speech naturalness is or more generally what is expected from a synthesis system once its intelligibility is rated sufficiently highly. In order to explore this domain it will be mandatory to cooperate with psychologists and human factors specialists.

Although the recent developments of speech synthesis demonstrated the power of the concatenative approach, it seems that there is much room for improvement:

  1. Choice of Non-uniforms and Multi-scale Units (see section 5.1.2): What are the best synthesis units? this question is rooted in psycholinguistics, and is a challenging problem to phonology.
  2. Speech Signal Modification: Signal representation for speech is still an open problem, particularly for manipulation of the excitation.
  3. Voice Conversion: What are the parameters, phonetic description, methods for characterization of a particular speaker, and conversion of the voice of a speaker into the voice of another speaker [VMT92]?

Accurate physical modeling of speech production is still not mature for technological applications. Nevertheless, as both basic knowledge on speech production and the power of computers increase, articulatory synthesis will help in improving formant-based methods, take advantage of computational physics (fluid dynamics equations for the vocal apparatus), and better mimic the physiology of human speech production.

Synthesis of human voice is not limited to speech synthesis. Since the beginning of speech synthesis research, many workers also paid some attention to the musical aspects of voice and to singing [Sun87]. Like TtS, synthesis of singing finds its motivations both in science and technology: on the one hand singing analysis and synthesis is a challenging field for scientific research, and on the other hand, it can serve for music production (contemporary music, film and disk industries, electronic music industry). Like in speech synthesis, two major types of techniques are used for signal generation: descriptive-acoustic methods (rule-based formant synthesis) and signal processing methods (modification/concatenation of pre-recorded singing voices).

Future Directions

Prosodic modeling is probably the domain from which most of the improvements will come. In the long run it may be argued that the main problems to be solved deal mainly with mastering the linguistic and extra-linguistic phenomena related to prosody, which reflect problems of another kind, related to oral person-to-person and person-to-machine interactions.

Concerning the phonetic-acoustic generation process it may be foreseen that in the short run concatenative and articulatory syntheses will be boosted by the development of the microcomputer industry. By using off-the-shelf components it is already possible to implement a system using a large number of speech segments, with several variants that take into account contextual and prosodic effects, even for several speakers. This tendency can only be reinforced by the apparently unlimited evolution of computer speed and memory capacity, as well as by the fact that the computer industry not only provides the tools but also the market: speech synthesis nowadays must be considered to be as one of the most attractive aspects of virtual reality; it will benefit from the development of of multimedia and information highways.



next up previous contents
Next: 5.3 Text Interpretation for TtS Synthesis Up: 5 Spoken Output Technologies Previous: 5.1 Overview