next up previous contents index
Next: 6.4 Spoken Language Dialogue Up: 6 Discourse and Dialogue Previous: 6.3 Discourse Modeling

6.3 Dialogue Modeling

Phil Cohen
Oregon Graduate Institute of Science & Technology, Portland, Oregon, USA

6.3.1 Research Goals

Two related, but at times conflicting, research goals are often adopted by researchers of dialogue. First is the goal of developing a theory of dialogue, including, at least, a theory of cooperative task-oriented dialogue, in which the participants are communicating in service of the accomplishment of some goal-directed task. The often unstated objectives of such theorizing have generally been to determine:

A second research goal is to develop algorithms and procedures to support a computer's participation in a cooperative dialogue. Often, the dialogue behavior being supported may only bear a passing resemblance to human dialogue. For example, database question-answering [ARP93] and frame-filling dialogues [Bil91,BGS90,BPUG77] are simplifications of human dialogue behavior in that the former consists primarily of the user asking questions, and the system providing answers, whereas the latter involve the system prompting the user for information (e.g., a flight departure time). Human-human dialogues exhibit much more varied behavior, including clarifications, confirmations, other communicative actions, etc. Some researchers have argued that because humans interact differently with computers than they do with people [DJ92,FG91], the goal of developing a system that emulates real human dialogue behavior is neither an appropriate, nor attainable target [DJ92,Shn80]. On the contrary, others have argued that the usability of current natural language systems, especially voice-interactive systems in a telecommunications setting, could benefit greatly from techniques that allow the human to engage in behavior found in their typical spoken conversations [KD91]. In general, no consensus exists on the appropriate research goals, methodologies, and evaluation procedures for modeling dialogue.

Three approaches to modeling dialogue---dialogue grammars, plan-based models of dialogue, and joint action theories of dialogue---will be discussed, both from theoretical and practical perspectives.

6.3.2 Dialogue Grammars

One approach with a relatively long history has been that of developing a dialogue grammar [PS84,Rei81,SC75]. This approach is based on the observation that there exist a number of sequencing regularities in dialogue, termed adjacency pairs [SSJ78], describing such facts as that questions are generally followed by answers, proposals by acceptances, etc. Theorists have proposed that dialogues are a collection of such act sequences, with embedded sequences for digressions and repairs [Jef72]. For some theorists, the importance of these sequences derives from the expectations that arise in the conversants for the occurrence of the remainder of the sequence, given the observation of an initial portion. For instance, on hearing a question, one expects to hear an answer. People can be seen to react to behavior that violates these expectations.

Based on these observations about conversations, theorists have proposed using phrase-structure grammar rules, following the Chomsky hierarchy, or equivalently, various kinds of state machines. The rules state sequential and hierarchical constraints on acceptable dialogues, just as syntactic grammar rules state constraints on grammatically acceptable strings. The terminal elements of these rules are typically illocutionary act names [Aus62,Sea69], such as request, reply, offer, question, answer, propose, accept, reject, etc. The non-terminals describe various stages of the specific type of dialogue being modeled [SC75], such as initiating, reacting, and evaluating. For example, the SUNDIALSUNDIAL system [ABC90,And92,Bil91,BGS90,GS88] uses a 4-level dialogue grammar to engage in spoken dialogues about travel reservations. Just as syntactic grammar rules can be used in parsing sentences, it is often thought that dialogue grammar rules can be used in parsing the structure of dialogues. With a bottom-up parser and top-down prediction, it is expected that such dialogue grammar rules can predict the set of possible next elements in the sequence, given a prior sequence [GWF90]. Moreover, if the grammar is context-free, parsing can be accomplished in polynomial time.

From the perspective of a state machine, the speech act become the state transition labels. When the state machine variant of a dialogue grammar is used as a control mechanism for a dialogue system, the system first recognizes the user's speech act from the utterance, makes the appropriate transition, and then chooses one of the outgoing arcs to determine the appropriate response to supply. When the system performs an action, it makes the relevant transition, and uses the outgoing arcs from the resulting state to predict the type of response to expect from the user [DJ92].

Arguments against the use of dialogue grammars as a general theory of dialogue have been raised before, notably by [Lev81].

First, dialogue grammars require that the communicative action(s) being performed by the speaker in issuing an utterance be identified. In the past, this has been a difficult problem for people and machines, for which prior solutions have required plan recognition [AP80,Car90,Kau90,PA80]. Second, the model typically assumes that only one state results from a transition. However, utterances are multifunctional. An utterance can be, for example, both a rejection and an assertion, and a speaker may expect the response to address more than one interpretation. The dialogue grammar subsystem would thus need to be in multiple states simultaneously, a property typically not allowed. Dialogues also contain many instances of speakers' using multiple utterances to perform a single illocutionary act (e.g., a request). To analyze and respond to such dialogue contributions using a dialogue grammar, a calculus of speech acts needs to be developed that can determine when two speech acts combine to constitute another. Currently, no such calculus exists. Finally, and most importantly, the model does not say how systems should choose amongst the next moves, i.e., the states currently reachable, in order for it to play its role as a cooperative conversant. Some analogue of planning is thus required.

In summary, dialogue grammars are a potentially useful computational tool to express simple regularities of dialogue behavior. However, they need to function in concert with more powerful plan-based approaches (described below) in order to provide the input data, and to choose a cooperative system response. As a theory, dialogue grammars are unsatisfying as they provide no explanation of the behavior they describe, i.e., why the actions occur where they do, why they fit together into a unit, etc.

6.3.3 Plan-based Models of Dialogue

Plan-based models are founded on the observation that utterances are not simply strings of words, but rather are the observable performance of communicative actions, or speech acts [Sea69], such as requesting, informing, warning, suggesting, and confirming. Moreover, humans do not just perform actions randomly, but rather they plan their actions to achieve various goals, and in the case of communicative actions, those goals include changes to the mental states of listeners. For example, speakers' requests are planned to alter the intentions of their addressees. Plan-based theories of communicative action and dialogue [AP80,App85,Car90,CL90,CP79,PA80,Sad91,SI81] assume that the speaker's speech acts are part of a plan, and the listener's job is to uncover and respond appropriately to the underlying plan, rather than just to the utterance. For example, in response to a customer's question of Where are the steaks you advertised?, a butcher's reply of How many do you want? is appropriate because the butcher has discovered that the customer's plan of getting steaks himself is going to fail. Being cooperative, he attempts to execute a plan to achieve the customer's higher-level goal of having steaks. Current research on this model is attempting to incorporate more complex dialogue phenomena, such as clarifications [LA90,YI91,LA87], and to model dialogue more as a joint enterprise, something the participants are doing together [CWG86,CL91b,GS90,GK93].

The major accomplishment of plan-based theories of dialogue is to offer a generalization in which dialogue can be treated as a special case of other rational noncommunicative behavior. The primary elements are accounts of planning and plan-recognition, which employ various inference rules, action definitions, models of the mental states of the participants, and expectations of likely goals and actions in the context. The set of actions may include speech acts, whose execution affects the beliefs, goals, commitments, and intentions, of the conversants. Importantly, this model of cooperative dialogue solves problems of indirect speech acts as a side-effect [PA80]. Namely, when inferring the purpose of an utterance, it may be determined that not only are the speaker's intentions those indicated by the form of the utterance, but there may be other intentions the speaker wants to convey. For example, in responding to the utterance There is a little yellow piece of rubber, the addressee's plan recognition process should determine that not only does the speaker want the addressee to believe such an object exists, the speaker wants the addressee to find the object and pick it up. Thus, the utterance could be analyzed by the same plan-recognition process as an informative utterance, as well as both a request to find it and to pick it up.

Drawbacks of the Plan-based Approach

A number of theoretical and practical limitations have been identified for this class of models.

Illocutionary Act Recognition is Redundant:
Plan-based theories and algorithms have been tied tightly to illocutionary act recognition. In order to infer the speaker's plan, and determine a cooperative response, the listener (or system) had to recognize what single illocutionary act was being performed with each utterance [PA80], even for indirect utterances. However, illocutionary act recognition in the Allen and Perrault model [AP80,PA80] was shown to be redundant [CL80]; other inferences in the scheme provided the same results. Instead, it was argued that illocutionary acts could more properly be handled as complex action expressions, defined over patterns of utterance events and properties of the context, including the mental states of the participants [CL90]. Importantly, using this analysis, a theorist can show how multiple acts were being performed by a given utterance, or how multiple utterances together constituted the performance of a given type of illocutionary act. Conversational participants, however, are not required to make these classifications. Rather, they need only infer what are the speaker's intentions.

Discourse versus Domain Plans:
Although the model is capable of solving problems of utterance interpretation using nonlinguistic methods (e.g., plan-recognition), it does so at the expense of distinctions between task-related speech acts and those used to control the dialogue, such as clarifications [GS86,LA87,LA90]. To handle these prevalent features of dialogue, multilevel plan structures have been proposed, in which a new class of discourse plans is posited, which take task-level (or other discourse-level) plans as arguments [LA87,LA90,YI91]. These are not higher level plans in an inclusion hierarchy, but rather are metaplans, which capture the set of ways in which a single plan structure can be manipulated. Rather than infer directly how utterances further various task plans, as single-level algorithms do, various multilevel algorithms first map utterances to a discourse plan, and determine how the discourse plan operates on an existing or new task plan. Just as with dialogue grammars, multi-level plan recognizers can be used to generate expectations for future actions and utterances, thereby assisting the interpretation of utterance fragments [All79,AP80,Car85,Car90,Sid85], and even providing constraints to speech recognizers [And92,YI91,YHW89].

Complexity of Inference:
The processes of plan-recognition and planning are combinatorially intractable in the worst case, and in some cases, are undecidable [Byl91,Cha87,Kau90]. The complexity arises in the evaluation of conditions, and in chaining from preconditions to actions they enable. Restricted planning problems in appropriate settings may still be reasonably well-behaved, but practical systems cannot be based entirely on the kind of first-principles reasoning typical of general-purpose planning and plan-recognition systems.

Lack of a Theoretical Base:
Although the plan-based approach has much to recommend it as a computational model, and certainly has stimulated much informative research in dialogue understanding, it still lacks a crisp theoretical base. For example, it is difficult to express precisely what are the various constructs (plans, goals, intentions, etc.), what are the consequences of those ascribing those theoretical constructs to be the user's mental state, and what kinds of dialogue phenomena and properties the framework can handle. Because of the procedural nature of the model, it is difficult to determine what analysis will be given, and whether it is correct, as there is no independently stated notion of correctness. In other words, what is missing is a specification of what the system should do. Section gif will discuss such an approach.

6.3.4 Future Directions

Plan-based approaches that model dialogue simply as a product of the interaction of plan generators and recognizers working in synchrony and harmony, do not explain why addressees ask clarification questions, why they confirm, or even, why they do not simply walk away during a conversation. A new theory of conversation is emerging in which dialogue is regarded as a joint activity, something that agents do together [CWG86,CL91b,GS90,GK93,Loc94,Sch81,Suc87]. The joint action model claims that both parties to a dialogue are responsible for sustaining it. Participating in a dialogue requires the conversants to have at least a joint commitment to understand one another, and these commitments motivate the clarifications and confirmations so frequent in ordinary conversation.

Typical areas in which such models are distinguished from individual plan-based models are dealing with reference and confirmations. Clark and colleagues [CWG86,Cla89] have argued that actual referring behavior cannot be adequately modeled by the simple notion that speakers simply provide noun phrases and listeners identify the referents. Rather, both parties offer noun phrases, refine previous ones, correct misidentifications, etc. They claim that people appear to be following the strategy of minimizing the joint effort involved in successfully referring. Computer models of referring based on this analysis are beginning to be developed [HH92,Edm93]. Theoretical models of joint action [CL91b,CL91a] have been shown to minimize the overall team effort in dynamic, uncertain worlds [JM92]. Thus, if a more general theory of joint action can be applied to dialogue as a special case, an explanation for numerous dialogue phenomena, such as collaboration on reference, confirmations, etc.) will be derivable. Furthermore, such a theory offers the possibility for providing a specification of what dialogue participants should do, which could be used to guide and evaluate dialogue management components for spoken language systems. Finally, future work in this area can also form the basis for protocols for communication among intelligent software agents.



next up previous contents
Next: 6.4 Spoken Language Dialogue Up: 6 Discourse and Dialogue Previous: 6.2 Discourse Modeling