Department of Computer Science
Mississippi State University
The project combines programs developed by the Chemical Abstracts Service with the KUDZU system, a knowledge-based natural language processing system designed to extract relations from technical text. The performance of the system has been tested on a set of 116 articles randomly selected from a set of 2,000 articles from the Journal of Physical Chemistry. Since it was known that document analysts focussed on selected sections of articles, experiments examined the effects of analyzing certain segments of the articles - principally abstracts, introductions, and/or conclusions. For example, when the system processes only the title, abstract, and conclusion of the articles, it produces 84% of the indexes produced by the document analysts (a recall rate of 84%) with a precision rate of 62% (38% of the indexes produced by the system were not produced by the human document analysts; this latter figure is not an error rate, since many of the "overgenerated" indexes were actually relevant). This compares with a goal of 80% recall at the outset of the research.
An important obstacle to improved performance of the system is residual error in the determination of part of speech of the words in the text. Part-of-speech taggers with large bodies of text for training can achieve errors of 3 to 5 per cent; although this project has a relatively modest training set, only about 2.8% of the text is given an incorrect part of speech which interferes with later stages of processing (parsing and knowledge extraction). Yet that 2.8% represents, on average, a serious tagging error for every two sentences.
The project distinguishes between these serious errors and other errors which do not interfere with information extraction, and is pursuing means of reducing serious tagging errors. Two underutilized sources of information concerning the part of speech of an unknown word in text are (1) a larger context than is typically used, and (2) internal cues in the word itself. In exploring these resources, we combine rule-based tagging with neural networks trained to examine a much larger context window than is typical. Moreover, our recent experiments strongly suggest that the internal characteristics of an unknown word are more powerful predictors of part of speech than was suspected. The neural networks which examine these data are very large, and experiments include architecture issues (connectivity and recurrence) as well as training versus evolution issues.
At present, we are employing tools of chaos theory to examine the fractal and chaotic properties of language, with the goal of developing one or more iterated function systems having attractors corresponding to parts of speech.
Lois Boggess and Julia Hodges. 1994. A knowledge-based approach to indexing scientific text. Proceedings of the Human Language Technology Workshop. Morgan Kaufmann. 458.
Rajeev Agarwal. 1994. (Almost) automatic semantic feature extraction from technical text. Proceedings of the Human Language Technology Workshop. Morgan Kaufmann. 378-383.
Lois Boggess, Julia Hodges, and Jose Cordova. 1995. Automated knowledge derivation: Domain-independent techniques for domain-restricted text sources. International Journal of Intelligent Systems. 10 (10) 871-893.
Rajeev Agarwal. 1995. Semantic feature extraction from technical texts with limited human intervention. Ph.D. dissertation. Mississippi State University. http://www.cs.msstate.edu/PUBLICATIONS/theses_and_dissertations.html#D1995
Sonal Kulkarni. 1995. Indexer: A tool to access index information from an object-oriented knowledge base. Master's project. Mississippi State.
Julia Hodges, Shiyun Yie, Ray Reighart, and Lois Boggess. 1996. An automated system that assists in the generation of document indexes. Natural Language Engineering. 2 (2): 127-160.
Lois Boggess and Lynellen D. S. P. Smith. 1996. But "propeller" is a verb! Automatic tagging and noun/verb confusions. FLAIRS-96: Proceedings of the Ninth Florida Artificial Intelligence Research Symposium. 511-515.
Julia Hodges, Shiyun Yie, Sonal Kulkarni, and Ray Reighart. 1997. Generation and evaluation of indexes for chemistry articles. Journal of Intelligent Information Systems. 8 (1): 57-76.
Lois Boggess and Lynellen D. S. Perry. 1997. Real world auto-tagging of scientific text. FLAIRS-97: Proceedings of the Tenth International Florida Artificial Intelligence Research Symposium. 253-257.
This research project utilizes many methods of reasoning from evidence, where the reasoning is not hard-coded into programs from the outset. Rather, the algorithms employed allow the system to tune itself toward optimal performance on the basis of exposure to many examples. Such algorithms "learn" from this exposure. Our research has used and continues to use a range of such algorithms, from classification using Bayesian probability to rule-based machine learning, neural networks, genetic algorithms, and evidential reasoning. Much of the evidence, both at the surface level and at highly conceptual levels, involves features which are not only not independent, but strongly context dependent as well. To take a very simple example of context dependency, one's expectations of the subject matter of a paper on the web which contains the word "chair" change if "chair" is immediately preceded by "endowed" versus "Louis XIV" or "could".
The challenge is to find the best predictors possible in a problem space which contains a very large set of features interrelated in complex but meaningful patterns. More properly, since the problem is far too large to search for a provably optional solution, the challenge is to find a "good enough" solution in a reasonable investment of time.
Algorithms based on Bayesian probability have the advantage of being founded on a model which has been thoroughly studied and is well understood. The learning techniques mentioned above share the characteristic of searching problem spaces by directing their exploration for solutions according to functions which compare relative goodness of states of the system. The symbolic learning machine techniques are used for those parts of our system which benefit from the ability of humans to inspect, revise, or otherwise interact with the system. Genetic algorithms are especially known for efficient parallel searches for solutions in large problem spaces, and neural networks can discover and utilize complex interrelationships of multiple features.
Language almost never repeats itself on a macroscopic level, and yet it is self-similar, replete with subpatterns of tantalizing similarity and endless variety. Accordingly, we are exploring the application of the tools of dynamic systems to text. And we anticipate the extension of some of our present tools, especially our neural networks, to deal with the fractal and chaotic properties of language.
Special issue on Natural Language Processing in Communications of the ACM. January 1996, 39 (1).
James Allen. 1995. Natural Language Understanding (2nd ed). Benjamin/Cummings.
Eugene Charniak. 1993. Statistical Language Learning. MIT Press.
Kenneth Church and Robert Mercer. 1993. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics 19.
Adaptive Human Interfaces, Intelligent Interactive Systems for Persons with Disabilities