Two related, but at times conflicting, research goals are often adopted by researchers of dialogue. First is the goal of developing a theory of dialogue, including, at least, a theory of cooperative task-oriented dialogue, in which the participants are communicating in service of the accomplishment of some goal-directed task. The often unstated objectives of such theorizing have generally been to determine:
A second research goal is to develop algorithms and procedures to support a computer's participation in a cooperative dialogue. Often, the dialogue behavior being supported may only bear a passing resemblance to human dialogue. For example, database question-answering [ARP93] and frame-filling dialogues [Bil91,BGS90,BPUG77] are simplifications of human dialogue behavior in that the former consists primarily of the user asking questions, and the system providing answers, whereas the latter involve the system prompting the user for information (e.g., a flight departure time). Human-human dialogues exhibit much more varied behavior, including clarifications, confirmations, other communicative actions, etc. Some researchers have argued that because humans interact differently with computers than they do with people [DJ92,FG91], the goal of developing a system that emulates real human dialogue behavior is neither an appropriate, nor attainable target [DJ92,Shn80]. On the contrary, others have argued that the usability of current natural language systems, especially voice-interactive systems in a telecommunications setting, could benefit greatly from techniques that allow the human to engage in behavior found in their typical spoken conversations [KD91]. In general, no consensus exists on the appropriate research goals, methodologies, and evaluation procedures for modeling dialogue.
Three approaches to modeling dialogue---dialogue grammars, plan-based models of dialogue, and joint action theories of dialogue---will be discussed, both from theoretical and practical perspectives.
One approach with a relatively long history has been that of developing a dialogue grammar [PS84,Rei81,SC75]. This approach is based on the observation that there exist a number of sequencing regularities in dialogue, termed adjacency pairs [SSJ78], describing such facts as that questions are generally followed by answers, proposals by acceptances, etc. Theorists have proposed that dialogues are a collection of such act sequences, with embedded sequences for digressions and repairs [Jef72]. For some theorists, the importance of these sequences derives from the expectations that arise in the conversants for the occurrence of the remainder of the sequence, given the observation of an initial portion. For instance, on hearing a question, one expects to hear an answer. People can be seen to react to behavior that violates these expectations.
Based on these observations about conversations, theorists have
proposed using phrase-structure grammar rules, following the
Chomsky hierarchy, or equivalently, various kinds of state
machines. The rules state sequential and hierarchical constraints on
acceptable dialogues, just as syntactic grammar rules state
constraints on grammatically acceptable strings. The terminal
elements of these rules are typically illocutionary act names
[Aus62,Sea69], such as request, reply, offer, question,
answer, propose, accept, reject, etc. The non-terminals describe
various stages of the specific type of dialogue being modeled
[SC75], such as initiating, reacting, and
evaluating. For example, the SUNDIALSUNDIAL system
[ABC
90,And92,Bil91,BGS90,GS88]
uses a 4-level dialogue grammar to engage in spoken dialogues about
travel reservations. Just as syntactic grammar rules can be used in
parsing sentences, it is often thought that dialogue grammar rules can
be used in parsing the structure of dialogues. With a bottom-up
parser and top-down prediction, it is expected that such dialogue
grammar rules can predict the set of possible next elements in the
sequence, given a prior sequence [GWF90]. Moreover,
if the grammar is context-free, parsing can be accomplished in
polynomial time.
From the perspective of a state machine, the speech act become the state transition labels. When the state machine variant of a dialogue grammar is used as a control mechanism for a dialogue system, the system first recognizes the user's speech act from the utterance, makes the appropriate transition, and then chooses one of the outgoing arcs to determine the appropriate response to supply. When the system performs an action, it makes the relevant transition, and uses the outgoing arcs from the resulting state to predict the type of response to expect from the user [DJ92].
Arguments against the use of dialogue grammars as a general theory of dialogue have been raised before, notably by [Lev81].
First, dialogue grammars require that the communicative action(s) being performed by the speaker in issuing an utterance be identified. In the past, this has been a difficult problem for people and machines, for which prior solutions have required plan recognition [AP80,Car90,Kau90,PA80]. Second, the model typically assumes that only one state results from a transition. However, utterances are multifunctional. An utterance can be, for example, both a rejection and an assertion, and a speaker may expect the response to address more than one interpretation. The dialogue grammar subsystem would thus need to be in multiple states simultaneously, a property typically not allowed. Dialogues also contain many instances of speakers' using multiple utterances to perform a single illocutionary act (e.g., a request). To analyze and respond to such dialogue contributions using a dialogue grammar, a calculus of speech acts needs to be developed that can determine when two speech acts combine to constitute another. Currently, no such calculus exists. Finally, and most importantly, the model does not say how systems should choose amongst the next moves, i.e., the states currently reachable, in order for it to play its role as a cooperative conversant. Some analogue of planning is thus required.
In summary, dialogue grammars are a potentially useful computational tool to express simple regularities of dialogue behavior. However, they need to function in concert with more powerful plan-based approaches (described below) in order to provide the input data, and to choose a cooperative system response. As a theory, dialogue grammars are unsatisfying as they provide no explanation of the behavior they describe, i.e., why the actions occur where they do, why they fit together into a unit, etc.
Plan-based models are founded on the observation that utterances are not simply strings of words, but rather are the observable performance of communicative actions, or speech acts [Sea69], such as requesting, informing, warning, suggesting, and confirming. Moreover, humans do not just perform actions randomly, but rather they plan their actions to achieve various goals, and in the case of communicative actions, those goals include changes to the mental states of listeners. For example, speakers' requests are planned to alter the intentions of their addressees. Plan-based theories of communicative action and dialogue [AP80,App85,Car90,CL90,CP79,PA80,Sad91,SI81] assume that the speaker's speech acts are part of a plan, and the listener's job is to uncover and respond appropriately to the underlying plan, rather than just to the utterance. For example, in response to a customer's question of Where are the steaks you advertised?, a butcher's reply of How many do you want? is appropriate because the butcher has discovered that the customer's plan of getting steaks himself is going to fail. Being cooperative, he attempts to execute a plan to achieve the customer's higher-level goal of having steaks. Current research on this model is attempting to incorporate more complex dialogue phenomena, such as clarifications [LA90,YI91,LA87], and to model dialogue more as a joint enterprise, something the participants are doing together [CWG86,CL91b,GS90,GK93].
The major accomplishment of plan-based theories of dialogue is to offer a generalization in which dialogue can be treated as a special case of other rational noncommunicative behavior. The primary elements are accounts of planning and plan-recognition, which employ various inference rules, action definitions, models of the mental states of the participants, and expectations of likely goals and actions in the context. The set of actions may include speech acts, whose execution affects the beliefs, goals, commitments, and intentions, of the conversants. Importantly, this model of cooperative dialogue solves problems of indirect speech acts as a side-effect [PA80]. Namely, when inferring the purpose of an utterance, it may be determined that not only are the speaker's intentions those indicated by the form of the utterance, but there may be other intentions the speaker wants to convey. For example, in responding to the utterance There is a little yellow piece of rubber, the addressee's plan recognition process should determine that not only does the speaker want the addressee to believe such an object exists, the speaker wants the addressee to find the object and pick it up. Thus, the utterance could be analyzed by the same plan-recognition process as an informative utterance, as well as both a request to find it and to pick it up.
A number of theoretical and practical limitations have been identified for this class of models.
89].
Plan-based approaches that model dialogue simply as a product of the interaction of plan generators and recognizers working in synchrony and harmony, do not explain why addressees ask clarification questions, why they confirm, or even, why they do not simply walk away during a conversation. A new theory of conversation is emerging in which dialogue is regarded as a joint activity, something that agents do together [CWG86,CL91b,GS90,GK93,Loc94,Sch81,Suc87]. The joint action model claims that both parties to a dialogue are responsible for sustaining it. Participating in a dialogue requires the conversants to have at least a joint commitment to understand one another, and these commitments motivate the clarifications and confirmations so frequent in ordinary conversation.
Typical areas in which such models are distinguished from individual plan-based models are dealing with reference and confirmations. Clark and colleagues [CWG86,Cla89] have argued that actual referring behavior cannot be adequately modeled by the simple notion that speakers simply provide noun phrases and listeners identify the referents. Rather, both parties offer noun phrases, refine previous ones, correct misidentifications, etc. They claim that people appear to be following the strategy of minimizing the joint effort involved in successfully referring. Computer models of referring based on this analysis are beginning to be developed [HH92,Edm93]. Theoretical models of joint action [CL91b,CL91a] have been shown to minimize the overall team effort in dynamic, uncertain worlds [JM92]. Thus, if a more general theory of joint action can be applied to dialogue as a special case, an explanation for numerous dialogue phenomena, such as collaboration on reference, confirmations, etc.) will be derivable. Furthermore, such a theory offers the possibility for providing a specification of what dialogue participants should do, which could be used to guide and evaluate dialogue management components for spoken language systems. Finally, future work in this area can also form the basis for protocols for communication among intelligent software agents.