FSJ Issue #3:

Experiments with A Spoken Dialogue System for Taking the U.S. Census


FSJ Issue #2 (download 5.5MB postscript file)

Introduction

Every ten years, the U.S. Bureau of the Census (hereafter Census) attempts to collect information about each person in the United States of America. Most responses are collected via a printed questionnaire, which is mailed to each of the approximately 88 million (in 1990) residences in the U.S. The content of the questionnaire is mandated by the U.S. Congress. About 65 percent of the forms are filled out and returned. When forms are not returned, Census workers go to those residences to collect the information for each person living at each residence.

Between each decennial census, the Census Bureau investigates new procedures for taking the census. For example, the mailout/mailback form was first tested in 1960 and deployed in 1970; before that, census takers went ``door to door'' conducting interviews at each residence. In 1994 and 1995, spoken language technology was evaluated for possible use in the Year 2000 Census. This article reports the results of that study.

There are a number of excellent reasons to offer an automated spoken language system to the American public as an option for providing census information. According to a recent study, about 40 percent of the people living in the United States are functionally illiterate; they are unable to read well enough to understand instructions and complete a form [National Center for Education Statistics, 1993]. By comparison, almost everyone can understand and speak a language, and spoken language systems can be developed for different languages. Spoken language systems might therefore increase responsiveness.

Because speech is the most natural and efficient form of communication, a large segment of the population may prefer to give census information over the telephone. Moreover, spoken language systems are inexpensive relative to human interviewers, which could decrease the cost of taking the census, estimated at three to four billion dollars in the year 2000. Spoken language systems also provide the ability to offer immediate on-line help, as the caller can interrupt the automated interview if a question or problem occurs simply by saying ``operator,'' at which point the call can be transferred to a person.

The census task is ideally suited to the application of spoken language technology. In the census task, a highly structured yet natural dialogue can be designed to obtain the desired information. Through careful design of the wording of questions, or prompts, the system can constrain the caller to produce an acceptable range of responses. In highly constrained tasks with relatively small vocabularies, spoken language systems can produce acceptable performance today [Cole et al., 1995a]. Moreover, recognition technology is improving at a steady rate, and very low error rates can be expected by the year 2000, especially if large amounts of training data are used to capture the many sources of variability in the signal. For many reasons then--increased responsiveness, easy access, high accuracy, on-line help, and potential savings to the American taxpayer--spoken language systems present an excellent choice for the census task.

In this paper, we describe a study for and examine the issues involved in developing a spoken language system for the census task. The goals of the study were to develop, deploy and evaluate an experimental prototype spoken language system that would interact with a person over the telephone to capture the information required for the short form for the Year 2000 Census. The system was designed to conduct a complete interview, to transfer the call (and all information about the interview thus far) to a human operator at the caller's request, and to enable operators to review and edit calls using a graphical user interface.

We describe three phases of the study. Our report of Phase 1 describes research leading to an initial system that demonstrated the feasibility of the approach. Our report of Phase 2 describes the system developed for the 1995 Census Test, and the results of the test. Phase 3 involves a final field test of an improved version of the system. All three phases used the same recognition architecture, which is described first. We conclude with a discussion of future activities that are needed for the decennial census to benefit from spoken language systems.

Speech Recognition with Neural Networks

Correct recognition of the spoken responses to the Census questions was critical to the success of the system. All the recognizers deployed were based on the use of neural networks. The system used two classes of recognizers:

This section describes both classes of recognizer. The training data came from a corpus collected during phase 1, described in Section 3 and from existing speech data.

In all cases, the incoming speech (in this case 8-bit mu-law encoded digital samples at a rate of 8KHz), is converted to a representation suitable for recognition. We use perceptual linear predictive (PLP) analysis [Hermansky, 1990], which is based on linear predictive coding and takes into account some of the properties of human hearing. The seventh-order PLP coefficients and the energy in a window are computed every 10 msec and form a frame of speech.

Isolated Word Recognition.

Based on the results of our research, described below, we can expect the callers to provide concise responses to all questions. These responses will be either a single word (like ``male'') or a short utterance which can be treated as a single long word (like ``nineteen forty seven''). Each word in the recognition vocabulary is represented as a sequence of phonemes, with alternate sequences allowed. These pronunciations were derived from the manual phonetic labels from the phase 1 corpus. Recognition consists of two major steps: (1) computing phoneme probabilities for each frame, and (2) using a Viterbi search to find the most likely response given those probabilities.

Phoneme Probability Estimation.

We have achieved better recognition results by modeling speech at a finer level than the phoneme. We train neural networks to recognize--i.e., provide probabilities for--context-dependent phoneme parts. Each phoneme is divided into three states: an initial part that depends on the preceding phoneme, a middle part that does not depend on context, and a final part that depends on the next phoneme. For example, the word ``yes'' with phonetic representation j E s will be represented by nine network outputs .pau<j <j> j>E j<E <E> E>s E<s <s> s>.pau, where .pau represents silence before and after the utterance and the notation a<b represents the first part of the phoneme b following phoneme a; <b> represents the middle part of phoneme b; and b>c represents the last part of phoneme b preceding phoneme c. Similarly, the word ``no'' has six phoneme parts. The yes/no net, therefore, has 15 total phoneme parts plus five parts to model silence (.pau>j, .pau>n, s<.pau, oU<.pau and <.pau>) for 20 total outputs of the neural network. Other questions with more possible responses need considerably more. In the phase 2 protocol, every yes/no question also had to expect ``operator'' as a response, which increased the vocabulary and network size.

All the networks had between 25 and 45 hidden units; the performance was not sensitive to the exact number. The networks were trained on a combination of the hand-transcribed portion of the phase 1 corpus, using automatically located phoneme boundaries and a phonetically hand-labeled corpus of telephone speech [Cole et al., 1994c]. In the case of hand-labeled data, each labeled phoneme is cut into equal thirds and relabeled according to context as described above. The frames so labeled provide the training data for the network.

Using automatically located phoneme boundaries is important because of the large amount of time it takes to label phonemes manually. This process depends on the prior existence of a recognizer with the appropriate phoneme inventory (trained on a smaller set of hand-labeled data or on other data with the same phonemes). The recognizer is constrained to find the word said (known because the data is transcribed) and it automatically finds the highest-scoring placement of boundaries. These are then used to provide training frames for the new network.

In a further refinement of this process, the score from the automatic alignment was used to detect calls on which the recognizer had trouble; the boundaries for only these calls were manually checked and adjusted. This paid great dividends in performance.

The input to the networks is a selection of frame features for seven frames as shown in Figure 2.1.1. Each frame is represented by seven PLP coefficients plus energy, so there are 56 input features. We did not have the computational resources to train on every available frame of training data. Instead, we sub-sampled the data so that 1000 frames of each category were selected. The network is a three-layered perceptron, trained with a combination of gradient descent and conjugate-gradient optimization [Barnard and Cole, 1989], using a mean-squared criterion function (cf. [Barnard et al., 1995], [Fontaine et al., 1996], [Hutter, 1995], [Richard and Lippmann, 1991], [Bishop, 1995], [Konig et al., 1996] for the advantages and disadvantages of neural networks versus multivariate Gaussians).

   figure46
Figure 1: PLP coefficients in a 160 msec window provide the input features for a network which computes the phoneme-part probabilities for the frame in the center.

Viterbi Search.

The phonetic classification produces an estimate of the probability that each phoneme part is present at each 10 msec time frame. A Viterbi search then combines this matrix of phoneme classification scores over time so as to decide which word was spoken. The probability of a word is simply the product of the individual frame probabilities of its constituent phoneme parts. Thus each word is expressed in terms of allowed sequences of phonemes. The sequences allows for alternate pronunciations was derived from the 4000-call corpus.

Because the system recognizes only isolated words (or phrases treated as a single word), it is computationally inexpensive to perform N-best search. This means the system finds not only the best response, but also remembers the N top-scoring words. This approach is used for confidence estimation as described below.

Word Spotting.

While the vast majority of the responses are succinct, there are enough responses with extraneous speech that this problem needed to be dealt with. In addition, there is a great deal of background noise in many of the calls. To overcome these problems, we implemented a simple word-spotting approach in which all words and sounds not in the target set match a single garbage model. We use the approach described in [Boite et al., 1993], in which the output score for the garbage word is computed as the median value of the top N phoneme scores for each frame, where N varies with the task and is set empirically. The grammar for the responses is [garbage/silence] [target word] [garbage/silence], where garbage and silence is optional. For example, if ``yes'' or ``no'' is the expected response but the uttered phrase is ``yes I am'', then since the phoneme sequence making up ``I am'' is not expected and maybe not even represented on the network output, the garbage word will score higher than any of the other keywords on these frames and ``yes garbage'' will be detected.

Confidence.

The recognizer always returns a target response. This means that if the system is trying to recognize ``male'' or ``female'' and the caller just coughs or asks ``Do I press a button or what?'' then either ``male'' or ``female'' will be recognized. A confidence score is assigned to these matches so they can be rejected. Having a continuous score rather than a binary decision gives the dialogue module more information so it can make an appropriate response. An ideal confidence score will be low for all incorrect responses and high for all correct responses.

The confidence score is computed as the difference in average frame scores for the two top-scoring target words, which is compared to a threshold that depends on the two words. Thus for each recognizer we have a table of empirically determined thresholds; if the difference in scores falls below the appropriate threshold, then the recognition result is rejected. If the number of keywords is large (as in the case of the year and day recognizers), a single threshold is used.

Name Retrieval.

The name retrieval algorithm is more complex than the other recognition tasks. It uses the OGI alphabet recognizer [Cole et al., 1992] together with a database of first and last names. The alphabet recognizer has two stages. The first is essentially similar to the isolated word recognizer described above, where the words are the letters of the alphabet and any number of letters is allowed, separated by pauses. A second pass reclassifies the letters found in the first pass using a neural network. Unlike the frame-based networks, the features for this second pass come from the whole word and are designed to highlight acoustic features which are important cues for recognizing the English alphabet. For example, spectral features are heavily sampled just after the vowel onset because formant motion here can distinguish the preceding consonant. The network for the second pass has 212 input features, 32 hidden units and 27 outputs--one for each letter of the alphabet and one for noise. The noise output is used for all the sounds which the first pass recognized as a letter but which were not in fact a letter.

The second pass results in a vector of letter probabilities for every letter found by the first pass segmenter. This is analogous to the vector of phoneme probabilities found by the frame-based neural network in the first pass, and retrieval of names from the database is done using a Viterbi search. The database is organized as a tree in which all names with the same prefix share nodes in the tree. This results in a significant savings in retrieval time. The tree rapidly becomes bushy but low-scoring branches are pruned from the search before this happens.

The name-retrieval search assigns a probability to every name in the database for which the search path stayed above the pruning threshold. These acoustic probabilities are further multiplied by the a priori probability of the names as derived from an online copy of the Seattle white pages plus data provided by Census. The idea is that if two names are acoustically similar and one is much more common, it is preferred. The relative weight of the acoustic and prior probabilities was set empirically using a development corpus.

For example, the two names ``Ned'' and ``Nat'' would require five nodes in the name tree. They would share the N, from which there would be two branches for A and E. The N-A node would have a branch for T and the N-E node would have a branch for D. If the first letter had a score of .5 for the letter N, then the N node would have a score of .5. If the second letter had a score of .1 for A and .7 for E, then the N-A node would have a score of .05 and the N-E node would have a score of .35. Finally, if the third letter had a score of .1 for D and .4 for T, then the N-E-D node would have a score of .035 and the N-A-T node would have a score of .02. If this were the last letter, then the score would be retrieved for all names and ``Ned'' would win. However, if ``Nat'' is a sufficiently more common name, then the order could be reversed.

Phase 1: Demonstrating Feasibility

The goal of the Phase 1 research was to demonstrate the feasibility of using a spoken language system to conduct a census interview. The research performed and the system developed during this phase of the project are described elsewhere [Cole et al., 1993, Cole et al., 1994b, Cole et al., 1994a, Barnard et al., 1995] and are reviewed here only briefly.

For the feasibility study, the system was designed (a) to engage a cooperative speaker in a structured yet natural dialogue to obtain specific information about the speaker; (b) to review this information so that the speaker could identify responses that were recognized correctly and incorrectly by the system; and (c) to provide a graphical user interface for easy review and editing of the callers' responses.

The system was required to capture and recognize the following information automatically: (1) full name, (2) sex, (3) birth date, (4) marital status (now married, widowed, divorced, separated, or never married), (5) Hispanic origin (yes or no; if yes: Mexican, Mexican-American, Chicano, Puerto Rican, Cuban or other (specify)), (6) race (White, Black or Negro, American Indian (specify tribe), Eskimo, Aleut, Chinese, Japanese, Filipino, Asian Indian, Hawaiian, Samoan, Korean, Guamanian, Vietnamese or other (specify)).

The Phase 1 study consisted of (a) designing a dialogue that would be natural to use, yet constrain the speaker to produce concise and informative responses, (b) collecting a corpus of speech data using the system prompts, (c) using the corpus both to evaluate the effectiveness of the prompts, and to train recognizers for the expected words, (d) building a working system, and (e) evaluating the system to determine feasibility. We evaluated the recognition performance of the system, while our colleagues at the Census Bureau evaluated user satisfaction.

Dialogue Design and Evaluation

The goal of this research, described in detail in [Sutton et al., 1995], was to refine the selection and wording of the system prompts and to design a natural dialogue allowing conversational repair and review of the recognized information. It also resulted in a training corpus for the neural networks used by the recognizer.

Data Collection and Coding

System prompts were evaluated by collecting, transcribing and analyzing speech data from about 4,000 callers from twelve cities. Callers were solicited by the Census Bureau; respondents were typically Census employees or their friends and relatives. All responses to the prompts were transcribed at the word level (time-aligned to the waveform). A behavioral code was assigned to each prompt, indicating if the response fit classifications such as ``informative'' or ``concise.'' The behavioral coding scheme is described in [Sutton et al., 1995]. Data collection and coding was labor intensive, requiring the full time efforts of four people for over six months.

Evaluation of the Prompts

Analysis of the behavioral codes was undertaken for at least 1,200 callers for each prompt. Our analyses of the protocols focused on three questions: (1) What percentage of callers completed the protocol? (2) What percentage of responses contained the desired information? (3) What percentage of responses could be recognized with a small vocabulary key-word spotter?

Completion. We eliminated spurious data such as wrong number and crank calls and then totaled the number of callers who completed the protocol. Only 2.2 percent failed to complete the protocol.

Desired Information. Next we analyzed how many responses provided the requested information. Table 1 shows the percentage of responses that contain the desired information for each prompt. The percentage of informative responses ranged from 99.1 to 99.9 percent.

Conciseness. A detailed analysis was made of the distribution of informative responses. It was found that about 97 percent percent of the responses contained the desired word, either by itself (``Male'') or in a common phrase (``I'm male''). About 3 percent of the informative responses did not contain the exact word or phrase, but did provide the desired information. Subsequent analysis of this category revealed that, in fact, well over half of the responses were concise, and could be recognized without natural language processing. For example, instead of the target word ``White'' the caller may have said ``Caucasian.''

   table70
Table 1: Percentage of responses which are informative. (Note: aa1, aa2 and aa3 denote, respectively, the following behavioural coding: (1) aa1: respondent provided the desired word (e.g., ``male'') in response to the gender question; (2) aa2: desired word was provided in a common phrase (e.g., ``I'm male''); (3) aa3: the response did not contain the exact word or phrase but did provide the desired information.

System Overview

Figure 2 depicts the structure of the phase 1 system. The caller answers questions generated by the dialogue manager. The caller's recorded utterance is stored for later playback, and recognized according to the grammar specified by the dialogue manager. The recognized words and their confidence are passed to the dialogue manager which stores the information in database and generates the next prompt. An operator can access and modify the information database after listening to the responses.

The recognition was described in Section 2. The dialogue manager and operator monitoring are described in this section.

   figure117
Figure 2: Overview of the system

Dialogue Manager

Even with cooperative users, unexpected dialogue situations may arise. Some responses will inevitably fall outside the preferred response set. The ability to cope with unexpected dialogue situations is essential to achieving good system robustness. The Phase 1 system included three strategies for dialogue repair:

  1. Detecting breakdowns. The system is capable of identifying certain difficulties when they arise (e.g., low confidence for a recognized response).
  2. Recovering from breakdowns. Repair strategies supported by the Phase 1 system include repeating the question if confidence is low, confirming the response if confidence is medium, and taking the best guess and continuing with the next question if the system fails to recognize a response on a second attempt.
  3. Review, followed by confirmation or identification of errors. The dialogue concluded with a summary of the information recognized by the system. The caller was asked if the system's information was correct; if not, the caller was asked to identify the incorrect categories. A human operator could resolve errors either during or following the call, as described in the next section.

Operator Monitoring

The system included a graphical user interface that enabled a human operator to monitor active calls or review calls at a later time, since all utterances and system responses were saved. Each recognition response was displayed as text, color coded to indicate the system's confidence. The operator could listen to the caller's responses and change the record to correct recognition errors.

System Evaluation

Recognition Performance

Word Recognition

We developed task-dependent recognizers for each question. They were trained on the hand-transcribed portion of the corpus, using automatically located phoneme boundaries. Table 2 shows the system performance for the transcribed portion of the test set. Only responses containing a target word were run. The data collected for this task is noisier than other corpora we have collected. It is also regionally very diverse.

   table133
Table 2: System performance on test calls which contain one of the target words.

Name Recognition

The OGI name retrieval algorithm is designed for spelling with pauses between the letters [Fanty et al., 1992]. However, callers were not asked to pause during the collection of this corpus in order to create a data set on which to develop fluent letter recognition. Other than retraining the classifier, the system architecture has not yet changed; in particular, alternate segmentations are not considered. This has lowered the percentage correct significantly.

The name retrieval algorithm was modified to use prior probabilities estimated from name counts in the Seattle white pages, augmented with a list of 50,000 last names. For last names, there were 236 responses in the test set (the remaining were not yet transcribed). Of these, 190 were in the list derived from the Seattle phone book. The system correctly identified 135/190 for 71 percent.

For first names, there were 259 responses in the test set. Of these, 238 were in the list derived from the Seattle phone book. The system correctly identified 192/238 for 81 percent.

Census Focus Group

The spoken language system was ported to the Census Bureau, where a test of the system was conducted using 40 subjects. Half of the subjects used the spoken language system to conduct the census interview, and half used a human operator. The results of this study revealed that the subjects generally liked the system: ``Overall, respondents reacted favorably towards the computerized questionnaire. About 90 percent of the respondents in both the pre- and the posttest said that they were willing to answer the census by computer... In general, respondents were impressed with the system, ranking it just below a comparable interviewer-administered questionnaire [Jenkins and Kemper, 1994].''

Summary of Feasibility Study

Phase 1 of the OGI Census project demonstrated that a voice questionnaire could be designed to capture information on the census form with high reliability. The results of the Phase 1 research showed that careful structuring of the task dialogue can produce concise and informative responses on the census task. About 97 percent of all responses contained the desired word or phrase. Another 2 percent contained the appropriate information, (but in a form the system could not recognize) and would require processing by human operators; presumably, these responses would be flagged automatically by the system. The system's recognition accuracy was also judged to be acceptable for most response categories. The Census Bureau's focus study, while short on subjects, produced positive user experiences. (The study also produced comments by a few subjects that motivated Census to alter the protocol.) On the basis of the Phase 1 results, Census judged that feasibility was demonstrated, and Phase 2 of the project was initiated.

Phase 2: The 1995 Census Trial

The 1995 Census Trial was held between March 2nd and April 16th, 1995. The OGI spoken language system was brought on line at the Census facility at Jeffersonville, Indiana on March 8 and received its first valid call on March 14. The official test of the system was concluded on April 10.

Unfortunately, only 17 calls were received during the Census Trial. Several factors combined to create an effective barrier to the spoken language system. Participants in the 1995 census test were given no prior knowledge that a spoken language system was available as an option, and the user interface for reaching the system provided many obstacles to obtaining this information. Only those participants who called a Census operator were told about the system and they had the option of doing a live interview on the spot. The 1995 Census test included 400,000 households. Only 172 did phone interviews, and only 17 of these chose to try the spoken language system. Due to the low volume of calls, an additional two-day test was conducted with Census employees on April 11 and 12.

The system developed for the Census trial was designed to capture all of the information on the census short form for all members of a household. In preparation for the trial, we focused on four activities: (a) redesigning the protocol; (b) creating a Graphical User Interface (GUI) for operators to review and edit calls, and to complete interviews handed off by the system; (c) developing a system architecture to handle multiple calls and forward calls to operators when needed; and (d) developing new recognizers for the spoken dialogue system.

System Development and Deployment

Protocol Development

The protocol used in the 1995 Census test was designed to include all of the instructions and capture all of the information on the printed census short form used in the trial. General purpose questions prompted the caller for (a) the census number (an identification number on the printed form), (b) name and phone number, (c) address, and (d) home ownership and financing. In this portion of the interview, a list of the members in the household was created. The completeness of the list was verified with a coverage question, reviewing who should and who should not be reported as members of a household. For each person in the household, the system prompted the caller for the following information: (a) first and last name; (b) gender; (c) date of birth; (d) age (e) if they are of Spanish or Hispanic origin; (f) the relationship of each person in the household to the first person and (e) race.

In addition to the changes in the protocol necessitated by the additional questions and the inclusion of all household members, the protocol was changed in two fundamental ways. First, the system attempted to verify it's recognition of each utterance produced by the caller (except for yes/no responses). If the system was confident of its recognition response, it repeated the recognized word or phrase and asked the caller to verify it with a yes/no response. If the system was not confident, it repeated the prompt and then attempted to verify the recognized response. If the confidence was still low on the second attempt, the system proceeded to the next question.

This change in the protocol was required by Census, based on feedback from subjects in their focus group. The Phase 1 system reviewed the information that was recognized all at once; e.g., ``Here is a summary of the information recorded about you: First name: John; Last name: Smith; Sex: male; Date of birth: May 2, 1966, Age: 29; Origin: Non-Hispanic; Race: White; Property ownership: Loaned or mortgaged. Is this information correct? Please say yes or no. If no then: A human operator can correct information the system has wrong. Please list all of the following items that were incorrect: name, sex, date of birth, origin, race or property ownership.'' Some subjects in the focus study reported difficulty remembering all of the information, so the protocol was changed to verify each response.

A second change was that the caller was able to say ``operator'' at any time a response was requested (i.e., after hearing the tone indicating their turn to speak). If the system recognized ``operator,'' it confirmed that the caller wanted to speak to an operator with a yes/no question, then transferred the call. The system would also transfer the caller to an operator if two consecutive questions were skipped (i.e., when the system obtained low confidence on four consecutive responses).

Graphical User Interface

The graphical user interface (GUI) enabled Census operators to review and edit calls, and to conduct interviews that were passed to the operator by the system. The GUI was designed in collaboration with Census operators from Jeffersonville. Using the GUI, operators could select, listen to and edit any response provided by the caller. All responses that the caller indicated to be incorrect, as well as low confidence responses (i.e. questions skipped) were highlighted in red. Responses verified by the caller as correct were indicated in blue. Figure 3 shows a call record as displayed on the GUI.

When a caller wished to ``bail out'' to an operator, the system flashed a notice on all the operator stations. Operators could signal their readiness to take a call by clicking on the flashing notice. If an operator did not respond to a bail-out within five seconds, an auditory alarm was used to summon an operator to the station; operators were always nearby, so that unattended stations could be staffed smoothly and rapidly.

   figure184
Figure 3: The operator interface

System Architecture and Implementation

The system was implemented using a distributed architecture of seven computers connected on a local area network (LAN) in the configuration shown in Figure 4.

   figure191
Figure 4: Distributed architecture of the deployed Census system

The interface to the public telephone network was a set of standard telephony boards inside an Intel 486 personal computer running the Solaris operating system. A T1 line from the telephone network connected to a DianaTel EA24 channel bank and the resulting 24 voice channels then passed through a DianaTel SS96 crossbar switch. Each operator had a headset telephone connected to the public telephone network. The crossbar switch permitted dynamic routing of incoming calls from users and outgoing calls to the operators for editing and hand off of live calls.

Each voice channel then entered one of 24 AT&T 32C DSP's on three Linkon FC3000 boards. The system was implemented on a set of Digital Alpha Workstations running the OSF1 operating system and connected to the telephony PC on a TCP/IP LAN.

The system architecture is shown in Figure 5 and is most clearly explained in terms of an incoming call. A separate process on the telephony PC monitors each incoming line. This controlling process detects ring and answers the call by playing the introductory welcome and instructions. Then the dialogue is started as programmed by a script for this controlling process. A typical interaction consists of a prompt, followed by a recording that is shipped to the database server and the recognition server. The recognizer returns a result that determines the next action in the script. When the call terminates due to respondent bail-out, the switch is configured to reroute the call to an operator and the process is reset to wait for the next call.

While the prompt is playing the controlling process interacts with a multi threaded load balancing server which returns the information necessary to connect to a recognition server. This load balancing server assigns machines in a round robin fashion and monitors the various recognition servers, starting and stopping them as needed.

The system maintains a centralized database; access to the database is governed by a database process that handles locking and keeps a history of changes. Both the telephone-line-controlling processes and the operator-editing processes interact with this multi-threaded server.

   figure197
Figure 5: Logical components of the deployed Census system

System Deployment

Deploying the spoken language system at the Jeffersonville site turned out to be one of the major challenges of the project. Problems were encountered in connecting the system to the telephone network, and in routing calls to operators when the caller desired to bail out of the interview, but were eventually overcome.

System Development

The spoken dialogue system was modified to accommodate the expanded protocol, verification of each response, and operator hand-off. The changes involved (a) reprogramming the dialogue manager and (b) developing new recognizers for each prompt. New recognizers were required to handle the recognition vocabulary corresponding to each new prompt, and to changes in the recognition vocabulary caused by changes in the wording of certain prompts requested by Census (e.g., in the race question, ``Black or Negro'' was changed to ``White, Black, African American or Negro''). Moreover, because the caller could bail out to an operator after any prompt, it was necessary to incorporate the word ``operator'' into the recognition vocabulary for each response category. Thus, the neural network previously trained with phonetic output categories sufficient to discriminate ``male'' vs. ``female'' had to be retrained with output categories sufficient to discriminate ``male,'' ``female,'' and ``operator.'' No new data were collected to train the recognizers for the Census test. The recognizers were trained on OGI's phonetically labeled speech corpora [Cole et al., 1995b].

The Two-Day Census Load Test

Because so few calls were received during the actual test, a two-day load test was performed to assess the ability of the system to handle multiple simultaneous calls. For this test, Census workers in six divisions at Census headquarters were sent a memo asking them to test the system. The memo asked each person to call the system twice, once to report data for one person, and once to report data for two people.

System Performance. An analysis of the system performance is given in Tables 3 and 4. A large percentage of the spelled first- and last-name responses were not in our respective first- and last-name dictionaries but were still considered useful (Table 3, last column) because they were legitimate but infrequent names. We believe that the relatively large number of out-of-vocabulary names in the load test was probably caused by the Bureau's instructions to callers to provide ``unusual responses.''

   table211
Table 3: Analysis of performance of system in eliciting information from the user during the two day Census load test.

   table218
Table 4: Recognition performance during the two day Census load test.

Caller Impressions. After completing the questionnaire, each caller was asked if they would prefer to give the census using a written form or a spoken questionnaire. Of the 98 responses, 67 preferred using a written questionnaire to the system. It is not clear how this number should be interpreted, as the subjects were Census employees following a directive to call the system.

Operator Impressions. Census operators interacted with the system during the census test and load test to edit calls and to complete interviews. Three operators were interviewed after the tests. The operators were unanimous in their praise of the GUI; they found it effective and easy to use.

Summary of Census Field Tests

During the Census trial and the load test, the system was online about 85 percent of the time, and after some initial problems stability was achieved during the two-day load test for multiple simultaneous calls. The system received high marks from the operators and poor marks from callers. The lengthy instructions and need to verify each response made the system frustrating to use.

Phase 3: OGI Field Test of an Improved System

The goal of the Phase 3 effort, now underway, is to determine if an improved version of the spoken language system is the preferred option for taking census information for many Americans. Our working hypothesis is that an automated spoken questionnaire--if done well-- will be the preferred option for most people, and can produce superior responsiveness and accuracy at significant reduction in the cost of taking the census.

As a first step to demonstrate the viability of an automated voice questionnaire, we created a new system that is easier to access, more usable and takes less time to complete. The system was then tested on 133 subjects.

Protocol Design

The principal changes to the protocol included:

Solicitation of Callers

Letters were mailed to 250 households inviting people to try the system. The mailing list consisted of people who had responded previously to an OGI data collection effort and agreed to be put on a mailing list to participate in future projects, for which they would receive an award coupon. Potential callers were asked to have information about the name, sex, date of birth, origin, race and relationship of each person living at their address on May 1, 1995.

Clearly the sample for this field test is biased in favor of a voice questionnaire, because participants are selected from those who are willing to participate in experiments involving computer speech recognition. The results should therefore be viewed as a ``best case'' scenario for evaluating the current system.

Results of the OGI Field Test

The system was called by 133 people, of whom 112 (84 percent) completed the protocol. Of the 112 who completed the protocol, 87 (78 percent) indicated that they would prefer to give census information with a spoken language questionnaire rather than a written questionnaire.

The results of the field test are displayed in Tables 6, 5 and 7.

   table246
Table 5: Analysis of performance of system in eliciting information from the user.

   table253
Table 6: Average time to complete a call as a function of household size

   table260
Table 7: Recognition performance.

Table 6 presents a histogram of the number of persons for whom information was provided per household, and the average time per call for each number. Table 5 shows the number and percentage of responses that contained the desired information. The number of desired responses to the relation1 query was particularly low. This question was phrased as follows: ``Is this person your grandparent, parent, spouse, sibling, natural or adopted child, stepchild, sibling, grandchild, or other ?'' The vast majority of undesirable responses occurred when the caller responded with a ``yes'' to this question.

Table 7 displays the system's recognition performance for responses that contained the desired information. Again the recognition performance of the relation2 query is low. This was partly due to a set of easily confused keywords and the fact that no data could be collected in Phase 1 of the study, since this query was introduced at a later date. A proper protocol design and data collection would have led to a better phrasing of the question, a set of less confusable keywords and the ability to use a specialized recognizer with acceptable performance.

Discussion

Taking a census consists of a number of subtasks: establishing contact with the population, asking them questions, recording and codifying the answers, entering the information into a database, analyzing and summarizing the information, and presenting and distributing the information.

This study used a spoken language system to ask the questions, to record and codify the answers, and to flag those answers that could not be recognized with certainty. A large data collection effort and two field tests showed that, when questions are asked correctly, the answers contain information within the desired response categories about 99 percent of the time. This result contradicts the common belief among many speech researchers that responses to computer generated prompts are highly variable. It appears that callers respond appropriately to questions if they have a clear idea of the task and its goals, and if the dialogue and prompts are well designed. We have demonstrated, at least for the subject populations in these experiments, that census data are captured very well using a spoken language system.

We also evaluated our system's ability to recognize spoken responses, in order to automate the process of codifying the information for database entry. About 97 percent of the responses to questions contained words in the system's recognition vocabulary, thereby defining the upper limit of performance of our word-spotting system. For response categories with small vocabularies--such as questions requiring yes/no answers, gender and month born--error rates were less than 2 percent. If about 5 percent of the responses are selected for operator review based on a measure of recognition confidence, the error rate for these categories falls to less than 1 percent. For recognition of numbers (day of the month and year born) requiring recognition of more acoustically similar words (e.g., ``fifteenth'' vs. ``sixteenth'') error rates were as high as 25 percent. We can expect that both recognition performance and measures of confidence will improve steadily as we collect more training data, analyze errors, and find new ways to improve the technology through research--For example, additional work on recognition of year responses has improved recognition accuracy to 94 percent.

One of the most interesting results of the current study was the dramatic influence that changes in the protocol had upon user satisfaction. In phase 1, all responses were verified once at the end. This was done to minimize the distraction during the protocol. We were especially concerned that a response which was not correctly recognized, even on the second attempt (and therefore needs operator review), would upset and distract the caller and affect all subsequent reponses. However, the phase 1 study showed that reviewing all the information at once in a voice system was confusing for some callers.

In phase 2 therefore, the verification was done for each response. Callers did not like this and gave an unfavorable rating to the system. The system that was most favorably received had no verification; it simply asked each question, repeated it once if recognition was uncertain, and proceeded to the next question. It appears that the Phase 3 system may provide a viable means of taking census information, and this system can serve as a basis for future research.

We showed that a good graphical user interface provides an excellent means of reviewing and editing information. Since each response was saved, an operator was able to listen to any response and verify if it was recognized correctly. In the Census test, operators were also able to use the GUI to complete the interview.

We learned first hand how much work is involved deploying a research system for a real task--even a field trial such as this. Unexpected difficulties, such as how to how to pass a call from the system to an operator's terminal over a T-1 link, proved very time consuming.

The neural-network recognizers did moderately well, except on the numbers task (year and day). There was a tremendous amount of effort involved in collecting and labeling the training corpus. This emphasizes the need to automate the process: we need to work on general-purpose recognizers, speaker adaptation, automatic means of deriving confidence thresholds and tools to facilitate all this. Work on context-dependent modeling is progressing at OGI, with significant progress due to new training techniques and context clustering.

Taken together, the results of this project showed that the most important component of a spoken dialogue system is the dialogue. A successful system gives instructions efficiently, establishes expectations for the user, asks questions that constrain the possible responses, and proceeds in a straightforward manner to complete the interview. In form filling tasks, such as taking a census, in which operators can correct errors, it is more important to correctly identify uncertain responses than to have the user verify them.

Perspective

As we have noted, spoken language systems represent a remarkable opportunity for taking a national census. Current procedures are inefficient and expensive. Printed forms are sent to many illiterate persons, and four of ten households do not return the forms. Clearly, alternative procedures, such as spoken language systems, provide a viable alternate to filling out printed forms for many people. Using spoken language systems should increase responsiveness to the census and decrease cost.

The 1995 Census test was a test of the printed questionnaire. It was designed to test new forms and new procedures for handling them. Spoken language technology was never regarded as a viable option and was never properly tested.

To evaluate the value of spoken language technology for census taking, it is necessary to envisage scenarios in which the technology can be deployed most effectively, and then design experiments to test these scenarios. We hope that the encouraging results of the Phase 1 and Phase 3 studies will motivate additional work on this exciting project.

Acknowledgments

This research was supported by funds from the U.S. Bureau of the Census, the National Science Foundation and the U.S. Office of Naval Research. We are gratefult to Digital Equipment Corporation which donated part of the equipment needed to run the field trial. We thank Karen Van Vactor for her valuable insights on the GUI design and are grateful to the personnel of the Census Bureau's center in Jeffersonville, for their hospitality and support. We especially appreciate the enthusiastic support of Martin Appel and Larry Malakhoff throughout this project.

References

Barnard et al., 1995
Barnard, E., Cole, R., Fanty, M., and Vermeulen, P. (1995). Real-world speech recognition with neural networks. In Proceedings of the International Workshop on Applications of Neural Networks to Telecommunications 2, pages 186-193. Lawrence Erlbaum Associates, Hillsdale, NJ.

Barnard and Cole, 1989
Barnard, E. and Cole, R. A. (1989). A neural-net training program based on conjugate-gradient optimization. Technical Report CSE 89-014, Oregon Graduate Institute, 20000 N.W. Walker Rd., Beaverton, OR.

Bishop, 1995
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.

Boite et al., 1993
Boite, J., Boulard, H., D'Hoore, B., and Haesen, M. (1993). A new approach toward keyword spotting. In Proceedings of the 3rd European Conference on Speech Communication and Technology, pages 1273-1276, Berlin.

Cole et al., 1995a
Cole, R., Hirschman, L., Atlas, L., Beckman, M., Bierman, A., Bush, M., Cohen, J., Garcia, O., Hanson, B., Hermansky, H., Levinson, S., McKeown, K., Morgan, N., Novick, D., Ostendorf, M., Oviatt, S., Price, P., Silverman, H., Spitz, J., Waibel, A., Weinstein, C., Zahorian, S., and Zue, V. (1995a). The challengee of spoken language systems: Research directions for the nineties. IEEE Transactions on Speech and Audio Processing, 3(1):1-21.

Cole et al., 1994a
Cole, R., Novick, D., Burnett, D., Hansen, B., Sutton, S., and Fanty, M. (1994a). Towards automatic collection of the U.S. census. In Proceedings of ICASSP'94, volume I, pages 93-96, Adelaide, Australia.

Cole et al., 1993
Cole, R., Novick, D., Fanty, M., Sutton, S., Hansen, B., and Burnett, D. (1993). Rapid prototyping of spoken-language systems: The Year 2000 Census project. In Proceedings of the Conference on Spoken Language Systems, pages 19-23, Tokyo, Japan.

Cole et al., 1994b
Cole, R., Novick, D., Fanty, M., Vermeulen, P., Sutton, S., and Burnett, D. (1994b). A prototype voice-response questionnaire for the U.S. Census. In Proceedings of ICSLP-94, pages 683-686.

Cole et al., 1994c
Cole, R. A., Noel, M., Burnett, D. C., Fanty, M., Lander, T., Oshika, B., and Sutton, S. (1994c). Corpus development activities at the Center for Spoken Language Understanding. In Proceedings of the ARPA Workshop on Human Language Technology.

Cole et al., 1995b
Cole, R. A., Noel, M., Lander, T., and Durham, T. (1995b). New telephone speech corpora at CSLU. In Proceedings of Eurospeech95, Madrid, Spain.

Cole et al., 1992
Cole, R. A., Roginski, K., and Fanty, M. (to appear, 1992). English alphabet recognition with telephone speech. In Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann.

Fanty et al., 1992
Fanty, M., Cole, R. A., and Roginski, K. (1992). English alphabet recognition with telephone speech. In Moody, J. E., Hanson, S. J., and Lippman, R. P., editors, Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann.

Fontaine et al., 1996
Fontaine, V., Ris, C., Leich, H., Vantieghen, J., Accaino, S., and Compernolle, D. (1996). Comparison between two hybrid hmm/mlp approaches in speech recognition. In Proceedings 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing.

Hermansky, 1990
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of Acoustical Society of America, 87(4):1738-1752.

Hutter, 1995
Hutter, H. (1995). Comparison of a new hybrid connectionist-schmm approach with other hybrid approaches for speech recognition. In Proceedings 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 3311-3314.

Jenkins and Kemper, 1994
Jenkins, C. and Kemper, J. (1994). Report on respondents' attitudes towards a computer administered voice-recognition census short form. Internal report, Statistical Research Division, U.S. Bureau of the Census, October 28, 1994.

Konig et al., 1996
Konig, Y., Bourlard, H., and Morgan, N. (1996). Remap - experiments with speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 3350-3353.

National Center for Education Statistics, 1993
National Center for Education Statistics (1993). Adult literacy in America. Technical Report GPO 065-000-00588-3, U.S. Department of Education, Washington, DC.

Richard and Lippmann, 1991
Richard, M. D. and Lippmann, R. P. (1991). Neural network classifiers estimate bayesian a posteriori probabilities. Neural Computation, 3:461-483.

Sutton et al., 1995
Sutton, S., Hansen, B., Lander, T., Novick, D., and Cole, R. (1995). Evaluating the effectiveness of dialogue for an automated spoken questionnaire. Technical Report CSE95-12, Department of Computer Science and Engineering, Oregon Graduate Institute.

About this document ...

This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -bottom_navigation /ogi/staff/sutton/FSJ/main.tex.

The translation was initiated by Stephen Sutton on Mon Feb 24 14:45:12 PST 1997


Stephen Sutton
Mon Feb 24 14:45:12 PST 1997