Festival is a relatively new text to speech system built with extendible features for allowing definitions for new voices and languages. Although Festival does not yet fully support the addition of new voices or languages, most of the existing modules can be parameterized with appropriate values in order to do so. This document describes the definition of a new Spanish voice in Festival, using both the current modules and new ones. The method we used for voice_abc is the general one described in the Festival system documentation.
This general method is to set the appropriate values for the parameters in all the various sub-parts, including lexical parameters (the set of phones and letter to sound rules) and prosodic parameters (intonation and duration parameters).
The phoneset is the basic building block and its definition is completely language dependent. For this reason, we defined a new phoneset for Mexican Spanish following the Festival method in order to do so.
Mexican Spanish, and Spanish in general, is a very regular language, lexically speaking. That is, Spanish pronunciation can easily be predicted from its orthography. For this reason, we do not require a large lexicon of pronunciations, and we can do most of the work using well-defined letter to sound rules to predict phonetic transcriptions of the incoming text and syllabification rules to determine stress marks.
Defining a new LTS rule set is quite simple in Festival and we created a new one for Mexican Spanish in just a few minutes. However, defining a new syllabification rule system is not a simple task. Even if we decide to use the existing module to syllabify, this module was written for English rather than Spanish, and is fairly language dependent. Its use affects considerably the resultant intonation.
For this reason, we decided to create a new syllabification function for Mexican Spanish. This function uses lexical rules to syllabify and accentuate in Spanish and solves both syllabification and stress prediction at word level.
Festival supports two different mechanisms to improve phrase break prediction, both for English, one of which is simple and one more elaborate. The first one is based in a simple CART tree and has a poor performance when compared with the second method, which uses statistical information obtained for a large labeled database. As in the previous case, we can choose between using of one of these built-in methods or defining a new one according to the language dependence of the phrase break prediction process.
As we thought, there are some differences between the phrasing process in English and Spanish. Therefore, we must define a new function for phrasing. Since we dont have a labeled database for building a probabilistic model, we must build a CART tree to predict simple phrase breaks from punctuation. This CART tree does not differ very much from the original CART tree proposed in the Festival System Documentation.
We have created a new function for duration prediction for Spanish that works at the syllable level and differs a little from the duration prediction modules offered by festival. This function assumes that the duration of Spanish words depends on the number of syllables in the word rather than on the number of phonemes. That is, two words with the same number of syllables have similar durations, regardless of the number of phonemes. Although this is a simple assumption that does not take into account contextual information, it improves the naturalness in the resultant signal when compared with a simple average duration method.
Again, intonation is a language dependent module that presents funny results when used with a different language than that for which it was built. Therefore, we must define a new module for intonation. This module is similar to the general intonation module offered by festival in the prediction of accents and the prediction of F0 target values steps. These two stages work at syllable and word levels but not at the phrase level, resulting in a monotonic intonation. For this reason, we have created a new stage that changes the F0 contour at the phrase level in order to avoid monotonic results.
The prediction of accents consists of a simple accent CART tree that uses the stress information predicted by the syllabification module in order to determine when a syllable is accented and uses the number of syllables in the word to determine if the word is monosyllabic or not. As a result of an evaluation of those conditions, the CART tree returns three possible accent models for each syllable: accented, single or none.
The stage that predicts F0 target values returns a list of target points for a syllable, depending of the accent model predicted by the previous step. Those values are fixed using heuristic criteria for each of the accented models, and have not been exhaustively studied.
At the highest level of the intonation module is a new stage to incorporate different phrase intonation models. This stage takes a whole phrase and its related target points and fits them into a F0 contour. At the moment there are four possible contours: normal, comma, question and exclamation. These models are implemented as a single list of ordered pairs in Lisp. The addition of this step notably improves the naturalness of the resultant speech.
At the moment components for voice_abc are quite basic in comparison with the English voices components in Festival. Although, it is a notable difference in the quality of the resultant speech when we using them. In the other hand, most of the modules have been written as an experimental work, and have not been exhaustively explored. We hope to improve all the modules by defining a new phrasing function for Spanish and by using more contextual information as soon as new intonation models in the prosodic analysis.