next up previous contents index
Next: 5.4 Spoken Language Generation Up: 5 Spoken Output Technologies Previous: 5.2 Synthetic Speech Generation

5.3 Text Interpretation for TtS Synthesis

Richard Sproat
Bell Labs, Murray Hill, New Jersey, USA The problem of converting text into speech for some language can naturally be broken down into two subproblems. One subproblem involves the conversion of linguistic parameter specifications (e.g., phoneme sequences, accentual parameters) into parameters (e.g., formant parameters, concatenative unit indices, pitch time/value pairs) that can drive the actual synthesis of speech. The other subproblem involves the computation of these linguistic parameter specifications from input text, which for the present discussion we will assume to be written in the standard orthographic representation for the language in question, and electronically coded in a standard scheme such as ASCII, ISO, JIS, BIG5, GB, and the like, depending upon the language. It is this second problem that is the topic of this section.

In any language, orthography is an imperfect representation of the underlying linguistic form. To illustrate this point, and to introduce some of the issues that we will discuss in this section, consider an English sentence such as Give me a ticket to Dallas or give me back my money: see Figure gif.


Figure: Some linguistic structures associated with the analysis of the sentence, ``Give me a ticket to Dallas or give me back my money.''

One of the first things that an English TtS system would need to do is tokenize the input into words: for English this is not generally difficult though for some other languages it is more complicated. A pronunciation then needs to be computed for each word; in English, given the irregularity of the orthography, this process involves a fair amount of lexical lookup though other processes are involved too. Some of the words in the sentence should be accented; in this particular case, a reasonable accentuation would involve accenting content words like give, ticket, Dallas, back and money, and leaving the other words unaccented. Then we might consider breaking the input into prosodic phrases: in this case, it would be reasonable to intone the sentence as if there were a comma between Dallas and or. Thus, various kinds of linguistic information need to be extracted from the text, but only in the case of word boundaries can this linguistic information be said to be represented directly in the orthography. In this survey I will focus on the topics of tokenization into words; the pronunciation of those words; the assignment of phrasal accentuation; and the assignment of prosodic phrases. An important area about which I will say little is what is often termed text normalization, comprising things like end-of-sentence detection, the expansion of abbreviations, and the treatment of acronyms and numbers.

5.3.1 Tokenization

As noted above, one of the first stages of analysis of the text input is the tokenization of the input into words. For many languages, including English, this problem is fairly easy in that one can to a first approximation assume that word boundaries coincide with whitespace or punctuation in the input text. In contrast, in many Asian languages the situation is not so simple, since spaces are never used in the orthographies of those languages to delimit words. In Chinese for example, whitespace generally only occurs in running text at paragraph boundaries. The Chinese alphabet consists of several thousand distinct elements, usually termed characters. With few exceptions, characters are monosyllabic. More controversially, one can also claim that most characters represent morphemes.

Just as words in English may consist of one or more morphemes so Chinese words may also consist of one or more morphemes. In a TtS system there are various reasons why it is important to segment Chinese text into words (as opposed to having the system read the input character-by-character). Probably the easiest of these to understand is that quite a few characters have more than one possible pronunciation, where the pronunciation chosen depends in many cases upon the particular word in which the character finds itself. A minimal requirement for word segmentation would appear to be an on-line dictionary that enumerates the word forms of the language. Indeed, virtually all Chinese segmenters reported in the literature contain a reasonably large dictionary [CL92,WT93,LCS93,SSGC94]. Given a dictionary, however, one is still faced with the problem of how to use the lexical information to segment an input sentence: it is often the case that a sentence has more than one possible segmentation, so some method has to be employed to decide on the best analysis. Both heuristic (e.g., a greedy algorithm that finds the longest word at any point) and statistical approaches (algorithms that find the most probable sequence of words according to some model) have been applied to this problem.

While a dictionary is certainly a necessity for doing Chinese segmentation, it is not sufficient since in Chinese, as in English, any given text is likely to contain some words that are not found in the dictionary. Among these are words that are derived via morphologically productive processes, personal names and foreign names in transliteration. For morphologically complex forms, standard techniques for morphological analysis can be applied [Kos83,TL90,KKZ92,Spr92], though some augmentation of these techniques is necessary in the case of statistical methods [SSGC94]. Various statistical and non-statistical methods for handling personal and foreign names have been reported; see, for example, [CCZ92,WLC92,SSGC94].

The period since the late 1980s has seen an explosion of work on the various problems of Chinese word segmentation, due in large measure to the increasing availability of large electronic corpora of Chinese text. Still, there is much work left to be done in this area, both in improving algorithms, and in the development of replicable evaluation criteria, the current lack of which makes fair comparisons of different approaches well-nigh impossible.

5.3.2 Word Pronunciation

Once the input is tokenized into words, the next obvious thing that must be done is to compute a pronunciation (or a set of possible pronunciations) for the words, given the orthographic representation of those words. The simplest approach is to have a set of letter-to-sound rules that simply map sequences of graphemes into sequences of phonemes, along with possible diacritic information, such as stress placement. This approach is naturally best suited to languages like Spanish or Finnish where there is a relatively simple relation between orthography and phonology. For languages like English, however, it has generally been recognized that a highly accurate word pronunciation module must contain a pronouncing dictionary that at the very least records words whose pronunciation could not be predicted on the basis of general rules.gif Of course, the same problems of coverage as were noted in the Chinese segmentation problem also apply in the case of pronouncing dictionaries: many text words occur that are not to be found in the dictionary, the most important of these being morphological derivatives from known words, or previously unseen personal names.

For morphological derivatives, standard techniques for morphological analysis can be applied to achieve a morphological decomposition for a word; see [AHK87]. The pronunciation of the whole can then in general be computed from the (presumably known) pronunciation of the morphological parts, applying appropriate phonological rules of the language. Morphological analysis is of some use in the prediction of name pronunciation too, since some names are derived from others via fairly productive morphological processes (cf., Robertson and Robert). However, this is not always the case, and one must also rely on other methods. One such method involves computing the pronunciation of a new name by analogy with the pronunciation of a similar name [CCL90,Gol91] (and see also [DN91] for a more general application of analogical reasoning to word pronunciation). For example, if we have the name Califano in our dictionary and know its pronunciation, then we can compute the pronunciation of a hypothetical name Balifano by noting that both names share the final substring alifano: Balifano can then be pronounced on analogy by removing the phoneme /k/, corresponding to the letter C in Califano, and replacing it with the phoneme /b/. Yet another approach to handling proper names involves computing the language of origin of a name, typically by means of n-gram models of letter sequences for the various languages; once the origin of the name is guessed, language-specific pronunciation rules can be invoked to pronounce the name [Chu85,Vit91].

In many languages there are word forms that are inherently ambiguous in pronunciation, and for which a word pronunciation module as just described can only return a set of possible pronunciations, from which the most reasonable one must then be chosen. For example, the word bass rhymes with lass if it denotes a type of fish, and is homophonous with base if it denotes a musical range. An approach to this problem is discussed in [Yar94] (and see also [SHY92]). The method starts with a training corpus containing tagged examples in context of each pronunciation of a homograph. Significant local evidence (e.g., n-grams containing the homograph in question that are strongly associated to one or another pronunciation) and wide-context evidence (i.e., words that occur anywhere in the same sentence that are strongly associated to one of the pronunciations) are collected into a decision list, wherein each piece of evidence is ordered according to its strength (log likelihood of each pronunciation given the evidence). A novel instance of the homograph is then disambiguated by finding the strongest piece of evidence in the context in which the novel instance occurs, and letting that piece of evidence decide the matter. It is clear that the above-described method can also be applied to other formally similar problems in TtS, such as abbreviation expansion: for example is St. to be expanded as Saint or Street?

5.3.3 Accentuation

In many languages various words in a sentence are associated with accents, which are often manifested as upward or downward movements of fundamental frequency. Usually, not every word in the sentence bears an accent, however, and the decision of which words should be accented and which ones should not is one of the problems that must be addressed by a TtS system. More precisely, we will want to distinguish three levels of prominence, two being accented and unaccented, as just described, and the third being cliticized. Cliticized words are unaccented but additionally lack word stress, with the consequence that they tend to be durationally short.

A good first step in assigning accents is to make the accentual determination on the basis of broad lexical categories or parts of speech of words. Content words---nouns, verbs, adjectives and perhaps adverbs, tend in general to be accented; function words, including auxiliary verbs and prepositions tend to be deaccented; short function words tend to be cliticized. Naturally this presumes some method for assigning parts of speech, and in particular for disambiguating words like can which can be either content words (in this case, a verb or a noun), or function words (in this case, an auxiliary); fortunately, somewhat robust methods for part-of-speech tagging exist (e.g., [Chu88]). Of course, a finer-grained part-of-speech classification also reveals a finer-grained structure to the accenting problem. For example, the distinction between prepositions (up the spout) and particles (give up) is important in English since prepositions are typically deaccented or cliticized while particles are typically accented [Hir93].

But accenting has a wider function than merely communicating lexical category distinctions between words. In English, one important set of constructions where accenting is more complicated than what might be inferred from the above discussion are complex noun phrases---basically, a noun preceded by one or more adjectival or nominal modifiers. In a discourse-neutral context, some constructions are accented on the final word ( Madison Avenue), some on the penultimate ( Wall Street, kitchen towel rack), and some on an even earlier word ( sump pump factory). Accenting on nominals longer than two words, is generally predictable given that one can compute the nominal's structure (itself a non-trivial problem), and given that one knows the accentuation pattern of the binary nominals embedded in the larger construction [LP77,LS92,Spr94]. Most linguistic work on nominal accent (e.g., [Fud84,LS92], though see [Lad84]) has concluded that the primary determinants of accenting are semantic, but that within each semantic class there are lexically or semantically determined exceptions. For instance, righthand accent is often found in cases where the lefthand element denotes a location or time for the second element (cf. morning paper), but there are numerous lexical exceptions ( morning sickness). Recent computational models---e.g., [Mon90,Spr94]---have been partly successful at modeling the semantic and lexical generalizations; for example [Spr94] uses a combination of hand-built lexical and semantic rules, as well as a statistical model based on a corpus of nominals hand-tagged with accenting information.

Accenting is not only sensitive to syntactic structure and semantics, but also to properties of the discourse. One straightforward effect is givenness. In a case like my son badly wants a dog, but I am allergic to dogs where the second occurrence of dogs would often be deaccented because of the previous mention of dog. (See [Hir93] for a discussion of how to model this and other discourse effects, as well as the syntactic and semantic effects previously mentioned, in a working TtS module.) While humanlike accenting capabilities are possible in many cases, there are still many unsolved problems, a point we return to in the concluding subsection.

5.3.4 Prosodic Phrasing

The final topic that we address is the problem of chunking a long sentence into prosodic phrases. In reading a long sentence, speakers will normally break the sentence up into several phrases, each of which can be said to stand alone as an intonational unit. If punctuation is used liberally so that there are relatively few words between the commas, semicolons or periods, then a reasonable guess at an appropriate phrasing would be simply to break the sentence at the punctuation marks---though this is not always appropriate [O'S89]. The real problem comes when long stretches occur without punctuation; in such cases, human readers would normally break the string of words into phrases, and the problem then arises of where to place these breaks.

The simplest approach is to have a list of words, typically function words, that are likely indicators of good places to break [Kla87]. One has to use some caution however, since while a particular function word like and may coincide with a plausible phrase break in some cases, in other cases it might coincide with a particularly poor place to break: I was forced to sit through a dog and pony show that lasted most of Wednesday afternoon.

An obvious improvement would be to incorporate an accurate syntactic parser and then derive the prosodic phrasing from the syntactic groupings: prosodic phrases usually do not coincide exactly with major syntactic phrases, but the two are typically not totally unrelated either. Prosodic phrasers that incorporate syntactic parsers are discussed in [O'S89,BF90]. O'Shaughnessy's system relies on a small lexicon of (mostly function) words that are reliable indicators of the beginnings of syntactic groups: articles such as a or the clearly indicate the beginnings of noun groups, for example. This lexicon is augmented by suffix-stripping rules that allow for part-of-speech assignment to words where this information can be predicted from the morphology. A bottom-up parser is then used to construct phrases based upon the syntactic-group-indicating words. Bachenko and Fitzpatrick employ a somewhat more sophisticated deterministic syntactic parser (FIDDITCH [Hin83]) to construct a syntactic analysis for a sentence; the syntactic phrases are then transduced into prosodic phrases using a set of heuristics.

But syntactic parsing sensu stricto may not be necessary in order to achieve reasonable predictions of prosodic phrase boundaries. [WH92] report on a corpus-based statistical approach that uses CART [BFOS84,Ril89] to train a decision tree on transcribed speech data. In training, the dependent variable was the human prosodic phrase boundary decision, and the independent variables were generally properties that were computable automatically from the text including: part of speech sequence around the boundary; the location of the edges of long noun phrases (as computable from automatic methods such as [Chu88,Spr94]); distance of the boundary from the edges of the sentence, and so forth.

5.3.5 Future Directions

This section has given an overview of a selected set of the problems that arise in the conversion of textual input into a linguistic representation suitable for input to a speech synthesizer, and has outlined a few solutions to these problems. As a result of these solutions, current high-end TtS systems produce speech output that is quite intelligible and in many cases quite natural. For example, in English it is possible to produce TtSoutput where the vast majority of words in a text are correctly pronounced, where words are mostly accented in a plausible fashion, and where prosodic phrase boundaries are chosen at mostly reasonable places. Nonetheless, even the best systems make mistakes on unrestricted text, and there is much room for improvement in the approaches taken to solving the various problems, though one can of course often improve performance marginally by tweaking existing approaches.

Perhaps the single most important unsolved issue that affects performance on many of the problems discussed in this section is that full machine understanding of unrestricted text is currently not possible, and so TtS systems can fairly be said to not know what they are talking about. This point comes up rather clearly in the treatment of accenting in English, though the point could equally well be made in other areas. As we noted above, previously mentioned items are often deaccented, and this would be appropriate for the second occurrence of dog in the sentence my son badly wants a dog, but I am allergic to dogs. But a moment's reflection will reveal that what is crucial is not the repetition of the word dog, but rather the repetition of the concept dog. That what is relevant is semantic or conceptual categories and not simply words becomes clear when one considers that one also would often deaccent a word if a conceptual supercategory of that word had been previously mentioned: My son wants a labrador, but I'm allergic to dogs. Various solutions involving semantic networks (such as WordNet) might be contemplated, but so far no promising results have been reported.

Note that message-to-speech systems have an advantage over text-to-speech systems precisely in that message-to-speech systems in some sense know what they are talking about since one can code as much semantic knowledge into the initial message as one desires. But TtS systems must compute everything from orthography which, as we have seen, is not very informative about a large number of linguistic properties of speech.



next up previous contents
Next: 5.4 Spoken Language Generation Up: 5 Spoken Output Technologies Previous: 5.2 Synthetic Speech Generation