SRI International developed an information extraction system called FASTUS, a permuted acronym standing for ``Finite State Automata-based Text Understanding Ssystem for application to general information extraction tasks. The choice of acronym is, however, unfortunately somewhat misleading, because FASTUS is an information extraction system, not a text understanding system. The former problem is a much simpler, more tractable problem that is characterised by a relatively straightforward specification of information to be extracted from the text that changes slowly over time, if at all, with only a fraction of the text being relevant to the extraction task, and with the author's underlying goals and nuances of meaning of little interest. In contrast, a text understanding task is to recover all of the information that there is in a text, including that which is only implicit in what is actually written. All the richness of natural language becomes fair game, including metaphor, metonymy, discourse structure, and the recognition of the author's underlying intentions, and the full interplay between language and world knowledge becomes central to the task.
Text understanding is extremely difficult, and presents a number of research problems that have not yet been adequately solved. On the other hand, the relative simplicity of the information extraction task means that the full complexity of natural language need not be confronted head-on. In fact, much simpler mechanisms can be successfully employed to solve the more constrained problem, and do so in a computationally efficient and conceptually elegant way. It was this insight that led to the development of the FASTUS system that was applied to the task of extracting information from articles about terrorism in Latin America for the MUC-4 evaluation [Hobbs et al., 1992; Appelt et al., 1993].
In contrast to NL-processing systems designed for text understanding applications, FASTUS does not do a complete syntactic and semantic analysis of each sentence. Instead, sentences are processed by a sequence of nondeterministic finite-state transducers. The output of each level of transducers becomes the input to the next level. Each level of processing produces some new linguistic structure, and perhaps discards some information that is irrelevant to the information extraction task. The non-determinism of the transducers makes it possible to produce local analyses of fragments of the input that can be combined into a complete analysis without the necessity of determining the complete structure of each sentence, when the effort of producing such a structure has little payoff for the task at hand. The non-determinism can also be exploited to produce competing analyses of portions of the text that can be compared, so that the best analysis can be selected for processing at subsequent levels, reducing the combinatoric complexity of the subsequent levels.
When the transducer for the final level enters a final state, the result is a ``raw template'' that is unified with other raw templates from the current and previous sentences. Finally a post-processor transforms the raw templates into the form required by the specifications of the task.
The basic architecture of the MUC-5 system has evolved in only relatively minor ways from the MUC-4 system. The primary difference between the MUC-4 FASTUS system and JV-FASTUS is the addition of a user-interface to facilitate the rapid development of the system in a new domain. When we developed the MUC-4 FASTUS system, we had extensive experience working in the terrorist domain, since we had adapted the TACITUS system to work in that domain for MUC-3. Before this year the questions were open whether the FASTUS system provided the basic tools necessary to develop a new information extraction system from scratch in a very short period of time, and whether the FASTUS approach would be successful with languages significantly different from English. We believe that our MUC-5 experience enables us to answer both of these questions with a confident ``yes.''