Ronald M. Kaplan
Xerox Palo Alto Research Center, Palo Alto, California, USA
A formal language is a set of strings (sometimes called sentences) made up by concatenating together symbols (characters or words) drawn from a finite alphabet or vocabulary. If a language has only a finite number of sentences, then a complete characterization of the set can be given simply by presenting a finite list of all the sentences. But if the language contains an infinite number of sentences (as all interesting languages do), then some sort of recursive or iterative description must be provided to characterize the sentences. This description is sometimes given in the form of a grammar, a set of pattern-matching rules that can be applied either to produce the sentences in the language one after another or else to recognize whether a given string belongs to the set. The description may also be provided by specifying an automaton, a mechanistic device that also operates either to produce or recognize the sentences of the language.
Languages have been categorized according to the complexity of the patterns that their sentences must satisfy, and the basic classifications are presented in all the standard textbooks on formal language theory, e.g., [HU79]. The sentences of a regular language, for example, have the property that what appears at one position in a string can depend only on a bounded amount of information about the symbols at earlier positions. Consider the language over the alphabet a, b, c whose sentences end with a c and contain an a at a given position only if there is an earlier b. The strings cccc and bac belong to this language but abc does not. This is a regular language since it only requires a single bit to record whether or not a b has previous appeared. On the other hand, the language whose sentences consist of some number of a's followed by exactly the same number of b's is not a regular language since there is no upper bound on the number of a's that the allowable number of b's depends on. This set of strings belongs instead to the mathematically and computationally more complex class of context-free languages.
A regular language can be described by
grammars in various notations that are known to be equivalent in
their expressive power---they each can be used to describe all and
only the regular languages. The most common way of specifying a
regular language is by means of a regular
expression, a formula that indicates the
order in which symbols can be concatenated, whether there are
alternative possibilities at each position, and whether substrings
can be arbitrarily repeated. The regular expression
denotes the regular language
described above. In the notation used here,
concatenation is represented by sequence in the
formula, alternatives are enclosed in braces and
separated by vertical bars, asterisks (often called
the Kleene closure operator) indicate
that strings satisfying the previous subexpressions can be freely
repeated, and
denotes the empty string, the string
containing no elements.
The regular languages are also exactly those languages that can be
accepted by a particular kind of automaton, a finite-state
machine. A finite-state
machine (fsm) consists of a finite number of states and a function
that determines transitions from one state to another as symbols are
read from an input tape. The machine starts at a distinguished
initial state with the tape positioned at the first symbol of a
particular string. The machine transitions from state to state as it
reads the tape, eventually coming to the end of the string. At that
point, if the machine is in one of a designated set of final
states, we say that the machine has accepted the string or
that the string belongs to the language that the machine
characterizes. An fsm is often depicted in a state-transition
diagram where circles representing the states are connected by arcs
that denote the transitions. An arrow points to the initial state
and final states are marked with a double circle. The fsm,
shown in Figure
, accepts the language
:
Figure: Finite-state machine diagram.
Because of their mathematical and computational simplicity, regular languages and finite-state machines have been applied in many information processing tasks. Regular expressions are often used to specify global search patterns in word-processors and in operating-system utilities such as Unix' Grep. The lexical analysis component of most modern programming language compilers is defined as a finite-state machine that recognizes identifier classes, punctuation, numbers, etc. [AU78]. But until relatively recently, finite-state techniques have not been widely used in natural language processing. This is in large measure the result of Chomsky's argument [Cho57] that natural language sentences are so rich and complicated that they cannot be accurately characterized by means of regular or finite-state descriptions. One consequence of this was that a generation of linguists and computational linguists came to believe that finite-state techniques were of little or no interest.
It may well be true that complete and accurate natural language syntactic and semantic dependencies lie beyond the power of finite-state description, but work in the last twenty years (and particularly in the last five years) has identified a number of important problems for which very efficient and effective finite-state solutions can be found.
One set of solutions relies on the observation that finite-state
descriptions can provide an approximation to the proper grammatical
description of a language, an approximation that is often good enough
for practical purposes. The information
extraction problem, for example,
requires that documents and passages be identified that are likely to
contain information relevant to a given user's needs. Full-blown
syntactic and semantic analyses of documents would certainly help to
solve this problem. But such analyses may provide much more
information than this task actually requires. Indeed,
[AHB
93] have constructed a finite-state solution that
is extremely efficient compared to more powerful but more complex
extraction systems and also has very favorable recall and precision
scores. Their finite-state pattern specifications can only
approximate a complete analysis of a text, but the approximation
seems close enough for this particular purpose.
There has also been growing interest in using finite-state machines for storing and accessing natural language dictionaries. [AJ88] observed that the words in a large lexicon can be very compactly represented in a state-transition graph. This is because the graph can be transformed using determinization and minimization techniques that are well-known from finite-state theory, with the result that prefixes and suffixes common to many words are collapsed into a single set of transitions. [LK93] discuss access methods for finite-state dictionary representations that permit efficient retrieval of translation and synonymy information associated with particular words.
A third set of problems require that strings be systematically transformed into other strings. For example, the negative prefix in in the abstract phonological representation such as in+practical must be realized with an assimilated nasal as in impractical, or the inflected form stopping must be mapped either to its stem stop (for use in information retrieval indexing) or to its morphological analysis stop+PresentParticiple as an initial step in further processing. One formalism for describing these kinds of string transformations are context-sensitive rewriting rules as discussed for example by [CH68]. [Joh72] and later [KK81,KK94] showed that for each rule of this type there is a finite-state transducer that maps input strings to exactly the same outputs that the rule prescribes. A transducer is a simple extension of a basic finite-state machine: as it reads its input tape and makes a transition from one state to another, it can also write a corresponding symbol on a second tape. When it reaches a final state and the end of the input, the string produced on the second tape is taken to be the output that the input string is mapped to.
Mathematically, the collection of input-output string-pairs for a given transducer (corresponding perhaps to a particular rule) constitutes a regular relation. Regular relations share many (but not all) of the formal properties of the regular languages (for example, closure under union and concatenation) and also enjoy certain other properties. In particular, the regular relations are closed under composition, and Kaplan and Kay use this fact to show that the effect of an entire collection of rewriting rules making up a phonological or morphological grammar can be implemented as a single finite-state transducer. This device is much more efficient than any scheme using the rules directly for performing the string transformations that the grammar describes. [Kos83] proposed a different rule notation, called two-level rules, for characterizing phonological and morphological variations. These rules also denote only regular relations and can be transformed to equivalent transducers.
In the years since Chomsky's original criticism of finite-state techniques used for syntactic description, our understanding of their mathematical and computational properties has increased substantially. Of most importance, however, is our increased awareness of the wide range of natural language processing problems that they can fruitfully be applied to. Problems such as dictionary access and morphological analysis seem to be inherently finite-state in character, and finite-state solutions for these problems are complete as well as efficient. For applications such as information extraction, finite-state approaches appear to provide extremely useful approximate solutions. An exciting body of current research is exploiting more sophisticated finite-state techniques to improve the accuracy of approximate syntactic analyses without sacrificing their processing efficiency (e.g., [Kos90,VT93]). Given these improvements, we expect to see finite-state techniques incorporated into a growing number of practical language processing systems.