Methodology for Knowledge Acquisition Jerry R. Hobbs SRI International Menlo Park, California and Costanza Navarretta Center for Sprokteknologi Kopenhavn Universitet Kopenhavn, Denmark Many proposed methodologies for knowledge acquisition can be usefully viewed as consisting of a bottom-up and a top-down phase. In the bottom-up phase, various sorts of data are examined to determine the knowledge that must be encoded. In the top-down phase, the knowledge structures that already exist are used to constrain how that knowledge is to be expressed. Three kinds of data have been used for the bottom-up phase. The first is citations in a corpus, that is, the uses of words in particular contexts. In order to use this data, one must choose the corpus and choose the words. This is done differently by different researchers. Fillmore and Atkins chose the single word "risk", although this is is seen as illustrative of a large class of words. They examined all the uses of this word in a large, mixed, general corpus. Hobbs advocates choosing a corpus associated with some particular domain or application, or perhaps a set of corpora associated with a set of domains. A single text or a small number of texts are selected from the corpora to serve as target texts. The words chosen are then just the words occurring in the target texts. The citations that are examined are all the uses of these words in all of the corpora. Lenat also chooses a target text from a corpus, in his case, one article from a small desk encyclopedia. He examines the uses of the words in this target text only in the target text itself, not in the corpus as a whole. The citations are then used to generate, or at least suggest, the knowledge to be encoded. Fillmore does this by classifying the citations into those exemplifying similar patterns. Hobbs asks what knowledge underlies or justifies the linguistic contexts the words occur in, in particular, the predicate-argument relations and the coreference relations. Lenat asks more intuitively what knowledge is presupposed by the text. For Fillmore and Hobbs, the task is to uncover precisely the right knowledge. For Lenat, it is to specify as much knowledge as possible. The second kind of data used in the bottom-up phase is dictionary definitions. One is better off if one uses a good dictionary, such as the Cobuild Dictionary. A dictionary definition is generally derived from a set of citations. It can be viewed as a lexicographer's abstract characterization of these citations. In a sense, then, it reduces to the first type of data. A third type of data has been used only in Dahlgren's methodology. People are asked to list all of the facts that they know or believe about some concept. This knowledge is then encoded into the knowledge base. This data can also be viewed data of the first type. People are generating citations for the word for the occasion. In the top-down phase of knowledge acquisition, the existing knowledge structures are used to constrain how the knowledge is expressed. The user is presented with a set of common patterns with which one can define words by specifying parameters, or in analogy with which one can define the words. This helps to enforce consistency in the knowledge base. They should of course be the right patterns, so that one is not forced to enter wrong information. Since the existing knowledge structures may not be adequate for the new knowledge to be defined, it is desirable for the user to be able to add to the kinds of knowledge structures, or at least to escape from given ontology. The top-down phase will of course be only as sophisticated as the representation scheme that is used. There are several varieties of representation schemes and kinds of knowledge that have been used. Most systems have a sort hierarchy in which concepts are placed. Ideally, this hierarchy should emerge from a study of the data. One has the feeling, however, that they are often pre-determined---pre-empirical, so to speak---and that they consequently force on the knowledge enterer choices that are inappropriate to the domain. For example, some systems force a top-level choice for verbs among physical, mental, and social, whereas many verbs can be used equally well in all these senses. In a way, these are not characterizations of words, but of word uses. Most systems also specify the selectional constraints that verbs and other predicates impose on their arguments, generally expressed in terms of categories in the sort hierarchy. Several systems (Dahlgren, Lenat, Fillmore) have frame-like structures which allow them to specify more complex information. Finally, there is a miscellaneous class of kinds of knowledge, including common coercions, telic information (Calzolari), the Vendler classification of verbs (Dahlgren), and primary and secondary dimensions of objects for use by dimensional adjectives (Lilog). Hobbs does not distinguish among these various types of knowledge, but uses predicate calculus axioms to represent them all. We have attempted to apply this method to several words in a small corpus, and it has led us to an elaboration of the steps of the method. We began by choosing a corpus. We rejected using a corpus of petrochemical patents and a corpus of medical discharge reports, because these texts were too dominated by arcane domain knowledge we did not understand. We rejected using a corpus of isolated pages extracted from mechanical patents, because we were afraid that the necessary context for understanding passages would sometimes be missing. We decided on a set of passages from automobile owners manuals on towing, changing tires, and changing oil, since these seemed more accessible to us and more typical of texts in general. This corpus was not ideal, since many of the texts were translations or written originally in English by non-native speakers. Our next task was to choose a target text, a single paragraph whose vocabulary we would attempt to define. Many paragraphs were rejected because the vocabulary was almost entirely technical. Some other paragraphs were rejected because they referred primarily to aspects of the outside world that were very little discussed in the rest of the text. We finally picked a paragraph that was, in a sense, at the boundary between the car and the world, a paragraph about precautions to exercise when having the vehicle towed. The method as we used it involved nine steps. We will elaborate on them here, illustrating each step with the example of the word "damage". We will also discuss at each step the tools we wished we had had to ease the task. 1. We first look for all occurrences of a word in the corpus. For us, since we used grep, the citations consisted of single lines. This generally gave us an inadequate amount of context, so the first thing we had to do was look at the corpus to recover the full sentence, and in some cases to look at previous sentences to resolve anaphora. Two examples of citations for the word "damage" are Proper equipment will help ensure that your vehicle is not damaged while being towed. and Operating with insufficient amount of oil can damage the engine, and such damage is not covered by warranty. The tool that would have helped at this phase is a fairly simple concordance program that would give us sentences instead of lines. In addition, it would have been good to be able to call up the previous sentence and the following sentence in some cases. 2. We next reduced the citations to their predicate argument relations. For us this amounted to rewriting the citations on a separate piece of paper in a stripped-down form. For the word "damage" we went through an interesting transformation while doing this step. Initially, we wrote down entries like damage vehicle damage engine result in damage cause damage Eventually, it occurred to us that for X to cause damage in Y is for X to damage Y, and that we should have been recording the causes of the damage, even in cases where we may need to read through several operators. For example, in Avoid overfilling, or the engine could be damaged. we want to record the stripped-down citation overfilling damages engine. The first tool that this experience suggests is something that allows one to click on the predicate and the arguments in a citation and have the stripped down citation automatically placed in a separate buffer. But since we may have to go back and revise what we have done later in the investigation, it would also be nice to be able to click on a stripped-down citation and get back to the full citation, so that we can revise or expand the stripped-down citation. A more sophisticated tool would be to use a parser to find the predicte-argument relations the target word occurs in. 3. The next step is to divide the stripped-down citations, or predicate-argument patterns into heaps, according to our rough intuitions about what things go together. Thus, we had three separate heaps for "damage". One contained, for example, antifreeze damages paintwork insufficient oil damages engine driving with deflated tire damages tire incorrect towing damages transmission The second heap contained prevent damage ensure no damage risk damage The third heap contained serious damage expensive damage damage beyond repair damage covered by warranty. In fact, these heaps were simply different areas of the page that we wrote our stripped-down citations on. The tool needed at this stage is therefore another simple text-editing tool, one that allows us to divide the buffer into different heaps, to move citations into and between heaps, and to join and split heaps. Where the knowledge base is going to be used in a natural language system that uses weighted abduction, it is possible that the size of these heaps can be used to determine initial values for the weights. (Mark Moens) 4. The next step is to come up with an abstract characterization of the facts about the word that justify each of the heaps. This is a matter of making explicit the intuitions that underlay the classification into categories in step 3. In fact, steps 3 and 4 will often be done in tandem. Recognizing a more abstract characterization may lead us to join two heaps, and failure to find a single abstract characterization may lead us to split a heap. For the word "damage" we came up with the following abstract characterizations. For the first heap, incorrect procedure damages component For the second, damage is bad. For the third, damaged components need to be repaired, and repairs cost money. In general, the user must use his or her intelligence in this step, but there are tools that could make suggestions. The typical knowledge-based natural language system has a sort hierarchy and selectional constraints and perhaps common coercions for predicates. This suggests two aids: If the target word is the predicate, determine from the sort hierarchy what supercategories dominate most of the arguments in a heap. Suggest that as the abstract characterization of the heap. Thus, if "transmission", "engine", and so on, we categorized as auto parts, then the pattern "... damage auto part" would have been suggested. If the target word is an argument in the pattern, suggest the selectional constraints and coercions associated with the predicate. Thus, in "prevent E" and "risk E", there is the presupposition that E is something bad or is coercible into something bad. Steps 1 through 4 constitute the bottom-up, or data-directed, phase of the acquisition process. They result in a rough characterization of the facts that we want to enter into the knowledge base. The remaining steps constitute the top-down phase and involve entering this knowledge into the system in a way consistent with the way knowledge already there is encoded. 5. The next step is to gain access to the core theory where the words in the informal statements are defined, or where equivalent terms are defined. For the word "damage" our first guess is that the appropriate theory is something more general than "Vehicles". It should rather be something like "Artifacts", since damage can occur to components of artifacts in general. One can imagine two tools for doing this. The first would give a kind of a map of the core theories, given a label that indicated the kinds of predicates they made available and indicating relations of instantiation with other core theories. The nodes in this network of theories could be expanded on request to give fuller descriptions of the topic of the theories, the predicates central to the theory, and example of words defined in terms of the theory. The second tool would work from the other direction. It would present the user with the names of the core theories that contained definitions of the words used in the informal statements of facts. The user could then request the more elaborate description of these theories. 6. The next step is to generalize the concept as much as possible. One should find the highest level theory to which the concept applies, and the word should be defined within that theory. For the word "damage" we generalized to the core theory of "Goal-Directed Systems". In order for the system to suggest higher level theories, it must have represented the instantiation relationships among theories. It must know, for example, that Goal-Directed Systems are an instance of Systems, that Artifacts are an instance of Goal-Directed Systems, and that Vehicles are an instance of Artifacts. 7. In order to determine if a higher level theory is the appropriate one at which a word should be defined, other instance core theories should be examined to see if the concept is appropriate at that level. For the word "damage", we generalized immediately from Vehicles to Artifacts because it is obvious that artifacts of any kind can be damaged. To see if we should generalize from Artifacts to Goal-Directed Systems, we examined other core theories that instantiated Goal-Directed Systems, namely, organizations and other social systems, cognitive systems, human bodies, trees, and so on. Among the things that can be damaged are a reputation, a blood vessel (especially if looked at from the physicians point of view). Natural objects can be damaged, but only if they are considered from a functional point of view. We would not say a stalk of wheat is damaged when we cut it down to harvest it, since the normal functional point of view for wheat is its role as food. We can speak of damage to a tree, but we are either viewing a tree as a goal-directed system whose goal is to grow and propagate, or we are viewing it from a human-oriented economic or aesthetic perspective. It sounds strange to say that a mountain is damaged, but we can if we view it from an economic or aesthetic perspective. The notion of damage is no longer appropriate if we generalize from Goal-Directed Systems to Systems, where there is no notion of functionality. If two asteroids pass near each other and alter each others orbits, we would not say the solar system has thereby been damaged. It may be that while the word is appropriate for most instance core theories of an abstract core theory, there are one or two for which it does not sound natural. For example, it does not sound right to say that a person hit by a car was damaged. The person was rather injured. There are three possible explanations (and responses) to this situation. The first is that the word "injure" has driven out the word "damage", since it applies more specifically to people. That is, both "damage" and "injure" are interpretable in this situation, but since "injure" is more specific, it has become the conventional way to refer damage to humans. In this case we need not explicitly exclude the possibility of damage to humans, but the generation component of the natural language system must have a way of favoring the more specific, more conventional modes of expression. The second way out is to simply stipulate that the word "damage" is not appropriate when applied to humans, by means of an axiom like damage(x,y) --> nonhuman(y) The third way out is to deny that a human being, when viewed as a whole rather than, say, in terms of their role in a social organization, is not a component in a goal-directed system, hence has no function, hence cannot be damaged. The tool associated with this step in the knowledge acquisition process is one which would allow one to navigate among core theories via "instance-of" relations, and again to click on theory labels to get to a full description of the predicates of the theories and the words defined in terms of them. Another level of detail would be the axioms themselves. Something else that can happen at this stage is a certain amount of consistency checking. The system could examine instance core theories and point out to the user that if a word is defined at this level, these are some of the consequences. 8. We have decided on the level of core theory that the concept should be define at. We next need to gain access to the predicates in this core theory in order to define our words. For the word "damage", we have decided on the core theory Goal-Directed Systems. This relies on a core theory for Plans, that gives us access to the structure of a plan, by giving us functions for the top goal of a plan and for the subgoals of the plan. A goal-directed system is then a system with a plan. The components of the system are subsystems or individual components, and the functions of a component are the goal eventualities which involve that component as an argument. We can then define the relevant words on the lexical periphery. "Damage" is a change in the structure of a component of a goal-directed system that causes the component to not perform its function in the system. A component is intact if it performs its function. A procedure acting on the system is some event or sequence of events that change the state of the system or some of its components. A correct procedure is one which does not cause the system not to function. Something is bad with respect to a goal if it causes that goal not to come about. To repair something is to change it from being not intact to being intact. Once we have axiomatized the core theory properly and defined the words in these ways, it should be possible to prove the following statements as theorems: It is possible for an incorrect procedure to damage a component. A correct procedure causes no damage. Damage is bad. Damage to a component may cause the system as a whole to fail. The tool that will help in this step is simply the documentation on the core theories, their available predicates, the meanings of the predicates, illustrative definitions of words in terms of the predicates, and so on. 9. The definitions are written down in the predicate calculus formalism, using the right predicates in the correct fashion. A tool that would help in this step is one that would translate informal specifications into the exact form of the axiom required by the system. The informal specification may be in a restricted subset of English that unambiguously signals its predicte-argument and implicational relationships. Or it could be a kind of VSO, parenthesized expression, such as cause(correct procedure, no damage) We have described the top-down phase of knowledge acquisition as though the relevant core theories were all there. But of course especially in the early stages, the constructions of the core theories will be very much driven by the bottom up processing. We will have a number of facts, and we need to construct the core theories in a way that makes it possible to state these facts.