One of the key ideas in this technology is to separate processing into several stages, in ``cascaded finite-state transducers''. A finite-state automaton reads one element at a time of a sequence of elements; each element transitions the automaton into a new state, based on the type of element it is, e.g., the part of speech of a word. Some states are designated as final, and a final state is reached when the sequence of elements matches a valid pattern. In a finite-state transducer, an output entity is constructed when final states are reached, e.g., a representation of the information in a phrase. In a cascaded finite-state transducer, there are different finite-state transducers at different stages. Earlier stages will package a string of elements into something the the next stage will view as a single element.
In the approach implemented in SRI International's system called FASTUS (a slightly altered acronym of Finite-State Automaton Text Understanding System)(Hobbs et al., 1997), the earlier stages recognize smaller linguistic objects and work in a largely domain-independent fashion. They use purely linguistic knowledge to recognize that portion of the syntactic structure of the sentence that linguistic methods can determine reliably, requiring relatively little modification or augmentation as the system is moved from domain to domain. The later stages take these linguistic objects as input and find domain-dependent patterns among them.
Typically there are five levels of processing:
As we progress through the five levels, larger segments of text are analyzed and structured. In each of stages 2 through 4, the input to the finite-state transducer is the sequence of chunks constructed in the previous stage.
This decomposition of the natural-language problem into levels is essential to the approach. Many systems have been built to do pattern matching on strings of words. The advances in information extraction have depended crucially on dividing that process into separate levels for recognizing phrases and recognizing patterns among the phrases. Phrases can be recognized reliably with purely syntactic information, and they provide precisely the elements that are required for stating the patterns of interest.
I will illustrate the levels of processing by describing what is done on the following sentences, from a biomedical abstract.
gamma-Glutamyl kinase, the first enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.
In this example, we will assume we are mapping the information into a complex database of pathways, reactions, and chemical compounds, such as the EcoCyc database developed by Karp and his colleagues at SRI International (Karp et al., 19??). In this database there are Reaction objects with the attributes ID, Pathway, and Enzyme, among others, and Enzyme objects with the attributes ID, Name, Molecular-Weight, Subunit-Component, and Subunit-Number.
The five phases are as follows:
1. Complex Words: This level of processing identifies multiwords such as ``gamma-Glutamyl proline'', Escherichia coli'', ``3,4-dehydroproline'', and ``molecular weight''.
Languages in general are very productive in the construction of short, multiword fixed phrases and proper names employing specialized microgrammars. This is the level at which they are recognized. The biomedical language is especially rich in this regard; this in fact may be the biggest barrier to information extraction research in biological domains. On the other hand, medical informatics has been at the forefront of human language technology in building up terminological resources, and there is much good recent work in automating the building of the lexicons and in the techniques for recognizing biomedical terms (e.g., Ananiadou et al., 2002).
2. Basic Phrases: At Level 2 the first example sentence is segmented into the following phrases:
| Enzyme Name: | gamma-Glutamyl kinase | |||
| Noun Group: | the first enzyme | |||
| Preposition: | of | |||
| Noun Group: | the proline biosynthetic pathway | |||
| Verb Group: | was purified | |||
| Preposition: | to | |||
| Noun Group: | homogeneity | |||
| Preposition: | from | |||
| Noun Group: | an Escherichia coli strain | |||
| Adjective Group: | resistant | |||
| Preposition: | to | |||
| Noun Group: | the proline analog | |||
| Noun Group: | 3,4-dehydroproline |
Noun groups are noun phrases up through the head noun but not including the right modifiers like prepositional phrases and relative clauses. Verb groups are head verbs with their auxilliaries. Adjective phrases are predicate adjectives together with their copulas, if present.
The noun group and verb group grammars that were implemented in FASTUS were essentially those given in the grammar of Sager (1981), converted into regular expressions.
This breakdown of phrases into nominals, verbals, and particles is a linguistic universal. Whereas the precise parts of speech that occur in any language can vary widely, every language has elements that are fundamentally nominal in character, elements that are fundamentally verbal or predicative, and particles or inflectional affixes that encode relations among the other elements.
3. Complex Phrases: At Level 3, complex noun groups and verb groups that can be recognized reliably on the basis of domain-independent, syntactic information are recognized. This includes the attachment of appositives to their head noun group,
the proline analog 3,4-dehydroproline
and the attachment of ``of'' prepositional phrases to their head noun groups,
the first enzyme of the proline biosynthetic pathway.
In the course of recognizing basic and complex phrases, entities and events of domain interest are often recognized, and the structures for these are constructed. In the sample text, an Enzyme structure is constructed for gamma-Glutamyl kinase. Corresponding to the complex noun group ``gamma-Glutamyl kinase, the first enzyme of the proline biosynthetic pathway,'' the following structure are built:
| Reaction: | ||||
| ID: | R1 | |||
| Pathway: | proline | |||
| Enzyme: | E1 | |||
| Enzyme: | ||||
| ID: | E1 | |||
| Name: | gamma-Glutamyl kinase | |||
| Molecular-Weight: | - | |||
| Subunit-Component: | - | |||
| Subunit-Number: | - |
In many languages some adjuncts are more tightly bound to their head nouns than others. ``Of'' prepositional phrases are in this category, as are phrases headed by prepositions that the head noun subcategorizes for. The basic noun group together with these adjuncts constitutes the complex noun group. Complex verb groups are also motivated by considerations of linguistic universality. Many languages have quite elaborate mechanisms for constructing complex verbs. One example in English is the use of control verbs; ``to conduct an experiment'' means the same as ``to experiment''. Another example is the verb-particle constructions such as ``set up''.
4. Clause-Level Domain Patterns: In the sample text, the domain patterns
Compound have Measure of values
Compound comprised ofCompound
are instantiated in the second sentence. These patterns result in the following Enzyme structures being built:
| Enzyme: | ||||
| ID: | E2 | |||
| Name: | - | |||
| Molecular-Weight: | 236,000 | |||
| Subunit-Component: | - | |||
| Subunit-Number: | - | |||
| Enzyme: | ||||
| ID: | E3 | |||
| Name: | - | |||
| Molecular-Weight: | - | |||
| Subunit-Component: | E4 | |||
| Subunit-Number: | 6 | |||
| Enzyme: | ||||
| ID: | E4 | |||
| Name: | - | |||
| Molecular-Weight: | 40,000 | |||
| Subunit-Component: | - | |||
| Subunit-Number: | - |
This level corresponds to the basic clause level that characterizes all languages, the level at which in English Subject-Verb-Object (S-V-O) triples occur, and thus again corresponds to a linguistic universal. This is the level at which predicate-argument relations between verbal and nominal elements are expressed in their most basic form.
5. Merging Structures: The first four levels of processing all operate within the bounds of single sentences. The final level of processing operates over the whole discourse. Its task is to see that all the information collected about a single entity or relationship is combined into a unified whole. This is where the problem of coreference is dealt with in this approach.
The three criteria that are taken into account in determining whether two structures can be merged are the internal structure of the noun groups, nearness along some metric, and the consistency, or more generally, the compatibility of the two structures.
In the analysis of the sample text, we have produced four enzyme structures. Three of them are consistent with each other. Hence, they are merged, yielding
| Enzyme: | ||||
| ID: | E1 | |||
| Name: | gamma-Glutamyl kinase | |||
| Molecular-Weight: | 236,000 | |||
| Subunit-Component: | E4 | |||
| Subunit-Number: | 6 |
The fourth is inconsistent because of the differing molecular weights and the subunit relation, and hence is not merged with the others.
The finite-state technology has sometimes been characterized as ad hoc and as mere pattern-matching. However, the approach of using a cascade of finite-state machines, where each level corresponds to a linguistic natural kind, reflects important universals about language. It was inspired by the remarkable fact that very diverse languages all show the same nominal element - verbal element - particle distinction and the basic phrase - complex phrase distinction. Organizing a system in this way leads to greater portability among domains and to the possibility of easier acquisition of new patterns.