Methodology for Knowledge Acquisition

                 Jerry R. Hobbs
                SRI International
             Menlo Park, California

                      and

               Costanza Navarretta
            Center for Sprokteknologi
              Kopenhavn Universitet
              Kopenhavn, Denmark

Many proposed methodologies for knowledge acquisition can be usefully
viewed as consisting of a bottom-up and a top-down phase. In the bottom-up
phase, various sorts of data are examined to determine the knowledge
that must be encoded.  In the top-down phase, the knowledge structures
that already exist are used to constrain how that knowledge is to be
expressed. 

Three kinds of data have been used for the bottom-up phase.  The first
is citations in a corpus, that is, the uses of words in particular 
contexts.  In order to use this data, one must choose the corpus and 
choose the words.  This is done differently by different researchers.
Fillmore and Atkins chose the single word "risk", although this is
is seen as illustrative of a large class of words.  They examined all
the uses of this word in a large, mixed, general corpus.  Hobbs advocates
choosing a corpus associated with some particular domain or application,
or perhaps a set of corpora associated with a set of domains.  A single
text or a small number of texts are selected from the corpora to serve
as target texts.  The words chosen are then just the words occurring 
in the target texts.  The citations that are examined are all the uses
of these words in all of the corpora.  Lenat also chooses a target text
from a corpus, in his case, one article from a small desk encyclopedia.
He examines the uses of the words in this target text only in the target
text itself, not in the corpus as a whole.

The citations are then used to generate, or at least suggest, the knowledge
to be encoded.  Fillmore does this by classifying the citations into
those exemplifying similar patterns.  Hobbs asks what knowledge underlies
or justifies the linguistic contexts the words occur in, in particular, the 
predicate-argument relations and the coreference relations.  Lenat asks
more intuitively what knowledge is presupposed by the text.  For Fillmore
and Hobbs, the task is to uncover precisely the right knowledge.  For 
Lenat, it is to specify as much knowledge as possible.

The second kind of data used in the bottom-up phase is dictionary 
definitions.  One is better off if one uses a good dictionary, such as
the Cobuild Dictionary.  A dictionary definition is generally derived
from a set of citations.  It can be viewed as a lexicographer's abstract
characterization of these citations.  In a sense, then, it reduces to
the first type of data.

A third type of data has been used only in Dahlgren's methodology.  
People are asked to list all of the facts that they know or believe about
some concept.  This knowledge is then encoded into the knowledge base.  
This data can also be viewed data of the first type.  People are generating
citations for the word for the occasion.  

In the top-down phase of knowledge acquisition, the existing knowledge
structures are used to constrain how the knowledge is expressed.  The
user is presented with a set of common patterns with which one can
define words by specifying parameters, or in analogy with which one
can define the words.  This helps to enforce consistency in the
knowledge base.  They should of course be the right patterns, so that
one is not forced to enter wrong information.  Since the existing
knowledge structures may not be adequate for the new knowledge to be
defined, it is desirable for the user to be able to add to the kinds
of knowledge structures, or at least to escape from given ontology.

The top-down phase will of course be only as sophisticated as the
representation scheme that is used.  There are several varieties of
representation schemes and kinds of knowledge that have been used.
Most systems have a sort hierarchy in which concepts are placed.
Ideally, this hierarchy should emerge from a study of the data.  One
has the feeling, however, that they are often
pre-determined---pre-empirical, so to speak---and that they
consequently force on the knowledge enterer choices that are
inappropriate to the domain.  For example, some systems force a
top-level choice for verbs among physical, mental, and social, whereas
many verbs can be used equally well in all these senses.  In a way,
these are not characterizations of words, but of word uses.

Most systems also specify the selectional constraints that verbs and
other predicates impose on their arguments, generally expressed in 
terms of categories in the sort hierarchy.  Several systems (Dahlgren,
Lenat, Fillmore) have frame-like structures which allow them to 
specify more complex information.  Finally, there is a miscellaneous 
class of kinds of knowledge, including common coercions, telic information
(Calzolari), the Vendler classification of verbs (Dahlgren), and primary
and secondary dimensions of objects for use by dimensional adjectives
(Lilog).  Hobbs does not distinguish among these various types of knowledge,
but uses predicate calculus axioms to represent them all.  

We have attempted to apply this method to several words in a small corpus,
and it has led us to an elaboration of the steps of the method.  We began
by choosing a corpus.  We rejected using a corpus of petrochemical patents
and a corpus of medical discharge reports, because these texts were too
dominated by arcane domain knowledge we did not understand.  We rejected
using a corpus of isolated pages extracted from mechanical patents, because
we were afraid that the necessary context for understanding passages would
sometimes be missing.  We decided on a set of passages from automobile 
owners manuals on towing, changing tires, and changing oil, since these
seemed more accessible to us and more typical of texts in general.  This 
corpus was not ideal, since many of the texts were translations or written
originally in English by non-native speakers.

Our next task was to choose a target text, a single paragraph whose 
vocabulary we would attempt to define.  Many paragraphs were rejected 
because the vocabulary was almost entirely technical.  Some other paragraphs
were rejected because they referred primarily to aspects of the outside
world that were very little discussed in the rest of the text.  We finally
picked a paragraph that was, in a sense, at the boundary between the car 
and the world, a paragraph about precautions to exercise when having the
vehicle towed.  

The method as we used it involved nine steps.  We will elaborate on them 
here, illustrating each step with the example of the word "damage".  We 
will also discuss at each step the tools we wished we had had to ease the
task.

1.  We first look for all occurrences of a word in the corpus.  For us,
since we used grep, the citations consisted of single lines.  This generally
gave us an inadequate amount of context, so the first thing we had to do
was look at the corpus to recover the full sentence, and in some cases to
look at previous sentences to resolve anaphora.  

Two examples of citations for the word "damage" are

     Proper equipment will help ensure that your vehicle 
     is not damaged while being towed.

and

     Operating with insufficient amount of oil can damage the
     engine, and such damage is not covered by warranty.

The tool that would have helped at this phase is a fairly simple concordance
program that would give us sentences instead of lines.  In addition, it
would have been good to be able to call up the previous sentence and the 
following sentence in some cases.

2.  We next reduced the citations to their predicate argument relations.  
For us this amounted to rewriting the citations on a separate piece of
paper in a stripped-down form.  

For the word "damage" we went through an interesting transformation while
doing this step.  Initially, we wrote down entries like

     damage vehicle
     damage engine
     result in damage
     cause damage

Eventually, it occurred to us that for X to cause damage in Y is for X to
damage Y, and that we should have been recording the causes of the damage,
even in cases where we may need to read through several operators.  For
example, in 

     Avoid overfilling, or the engine could be damaged.

we want to record the stripped-down citation

     overfilling damages engine.

The first tool that this experience suggests is something that allows
one to click on the predicate and the arguments in a citation and have
the stripped down citation automatically placed in a separate buffer.
But since we may have to go back and revise what we have done later in
the investigation, it would also be nice to be able to click on a 
stripped-down citation and get back to the full citation, so that we
can revise or expand the stripped-down citation.  A more sophisticated
tool would be to use a parser to find the predicte-argument relations
the target word occurs in.

3. The next step is to divide the stripped-down citations, or
predicate-argument patterns into heaps, according to our rough intuitions
about what things go together.  

Thus, we had three separate heaps for "damage".  One contained, for example,

     antifreeze damages paintwork
     insufficient oil damages engine
     driving with deflated tire damages tire
     incorrect towing damages transmission

The second heap contained

     prevent damage
     ensure no damage
     risk damage

The third heap contained

     serious damage
     expensive damage
     damage beyond repair
     damage covered by warranty.

In fact, these heaps were simply different areas of the page that we 
wrote our stripped-down citations on.  The tool needed at this stage
is therefore another simple text-editing tool, one that allows us to 
divide the buffer into different heaps, to move citations into and
between heaps, and to join and split heaps.  

Where the knowledge base is going to be used in a natural language system
that uses weighted abduction, it is possible that the size of these 
heaps can be used to determine initial values for the weights.  (Mark Moens)

4.  The next step is to come up with an abstract characterization of the
facts about the word that justify each of the heaps.  This is a matter
of making explicit the intuitions that underlay the classification into
categories in step 3.  In fact, steps 3 and 4 will often be done in 
tandem.  Recognizing a more abstract characterization may lead us to join
two heaps, and failure to find a single abstract characterization may 
lead us to split a heap.

For the word "damage" we came up with the following abstract
characterizations.  For the first heap,

     incorrect procedure damages component

For the second, 

     damage is bad.

For the third,

     damaged components need to be repaired, and repairs cost money.

In general, the user must use his or her intelligence in this step, but
there are tools that could make suggestions.  The typical knowledge-based
natural language system has a sort hierarchy and selectional constraints
and perhaps common coercions for predicates.  This suggests two aids:

     If the target word is the predicate, determine from the sort hierarchy
     what supercategories dominate most of the arguments in a heap.
     Suggest that as the abstract characterization of the heap.  Thus,
     if "transmission", "engine", and so on, we categorized as auto parts,
     then the pattern "... damage auto part" would have been suggested.

     If the target word is an argument in the pattern, suggest the 
     selectional constraints and coercions associated with the predicate.
     Thus, in "prevent E" and "risk E", there is the presupposition that
     E is something bad or is coercible into something bad.

Steps 1 through 4 constitute the bottom-up, or data-directed, phase of
the acquisition process.  They result in a rough characterization of the
facts that we want to enter into the knowledge base.  The remaining steps
constitute the top-down phase and involve entering this knowledge into the
system in a way consistent with the way knowledge already there is encoded.

5.  The next step is to gain access to the core theory where the words
in the informal statements are defined, or where equivalent terms are
defined.  

For the word "damage" our first guess is that the appropriate theory
is something more general than "Vehicles".  It should rather be something
like "Artifacts", since damage can occur to components of artifacts in
general.

One can imagine two tools for doing this.  The first would give a kind of
a map of the core theories, given a label that indicated the kinds of 
predicates they made available and indicating relations of instantiation
with other core theories.  The nodes in this network of theories could be
expanded on request to give fuller descriptions of the topic of the theories,
the predicates central to the theory, and example of words defined in terms
of the theory.

The second tool would work from the other direction.  It would present 
the user with the names of the core theories that contained definitions
of the words used in the informal statements of facts.  The user could
then request the more elaborate description of these theories.

6.  The next step is to generalize the concept as much as possible.  
One should find the highest level theory to which the concept applies,
and the word should be defined within that theory.

For the word "damage" we generalized to the core theory of "Goal-Directed
Systems".  

In order for the system to suggest higher level theories, it must have
represented the instantiation relationships among theories.  It must know,
for example, that Goal-Directed Systems are an instance of Systems,
that Artifacts are an instance of Goal-Directed Systems, and that Vehicles
are an instance of Artifacts.

7.  In order to determine if a higher level theory is the appropriate
one at which a word should be defined, other instance core theories should
be examined to see if the concept is appropriate at that level.

For the word "damage", we generalized immediately from Vehicles to Artifacts
because it is obvious that artifacts of any kind can be damaged.  To see if
we should generalize from Artifacts to Goal-Directed Systems, we examined
other core theories that instantiated Goal-Directed Systems, namely, 
organizations and other social systems, cognitive systems, human bodies,
trees, and so on.  Among the things that can be damaged are a reputation,
a blood vessel (especially if looked at from the physicians point of view).
Natural objects can be damaged, but only if they are considered from a 
functional point of view.  We would not say a stalk of wheat is damaged 
when we cut it down to harvest it, since the normal functional point of
view for wheat is its role as food.  We can speak of damage to a tree, 
but we are either viewing a tree as a goal-directed system whose goal
is to grow and propagate, or we are viewing it from a human-oriented
economic or aesthetic perspective.  It sounds strange to say that a 
mountain is damaged, but we can if we view it from an economic or 
aesthetic perspective.  

The notion of damage is no longer appropriate if we generalize from 
Goal-Directed Systems to Systems, where there is no notion of functionality.
If two asteroids pass near each other and alter each others orbits,
we would not say the solar system has thereby been damaged.  

It may be that while the word is appropriate for most instance core theories
of an abstract core theory, there are one or two for which it does not
sound natural.  For example, it does not sound right to say that a person
hit by a car was damaged.  The person was rather injured.  There are 
three possible explanations (and responses) to this situation.  

The first is that the word "injure" has driven out the word "damage",
since it applies more specifically to people.  That is, both "damage"
and "injure" are interpretable in this situation, but since "injure" is
more specific, it has become the conventional way to refer damage to 
humans.  In this case we need not explicitly exclude the possibility of
damage to humans, but the generation component of the natural language
system must have a way of favoring the more specific, more conventional
modes of expression.

The second way out is to simply stipulate that the word "damage" is not
appropriate when applied to humans, by means of an axiom like

     damage(x,y) --> nonhuman(y)

The third way out is to deny that a human being, when viewed as a whole
rather than, say, in terms of their role in a social organization,
is not a component in a goal-directed system, hence has no function, hence
cannot be damaged.

The tool associated with this step in the knowledge acquisition process
is one which would allow one to navigate among core theories via 
"instance-of" relations, and again to click on theory labels to get to a
full description of the predicates of the theories and the words defined 
in terms of them.  Another level of detail would be the axioms themselves.

Something else that can happen at this stage is a certain amount of 
consistency checking.  The system could examine instance core theories
and point out to the user that if a word is defined at this level, 
these are some of the consequences.  

8.  We have decided on the level of core theory that the concept should
be define at.  We next need to gain access to the predicates in this
core theory in order to define our words.

For the word "damage", we have decided on the core theory Goal-Directed
Systems.  This relies on a core theory for Plans, that gives us access
to the structure of a plan, by giving us functions for the top goal of
a plan and for the subgoals of the plan.  A goal-directed system is
then a system with a plan.  The components of the system are subsystems
or individual components, and the functions of a component are the 
goal eventualities which involve that component as an argument.

We can then define the relevant words on the lexical periphery.  "Damage"
is a change in the structure of a component of a goal-directed system
that causes the component to not perform its function in the system.
A component is intact if it performs its function.  A procedure acting
on the system is some event or sequence of events that change the state
of the system or some of its components.  A correct procedure is one which
does not cause the system not to function.  Something is bad with respect
to a goal if it causes that goal not to come about.  To repair something
is to change it from being not intact to being intact.

Once we have axiomatized the core theory properly and defined the words in
these ways, it should be possible to prove the following statements as 
theorems:

     It is possible for an incorrect procedure to damage a component.
     A correct procedure causes no damage.
     Damage is bad.
     Damage to a component may cause the system as a whole to fail.

The tool that will help in this step is simply the documentation on the
core theories, their available predicates, the meanings of the predicates,
illustrative definitions of words in terms of the predicates, and so on.

9. The definitions are written down in the predicate calculus formalism,
using the right predicates in the correct fashion.  

A tool that would help in this step is one that would translate informal
specifications into the exact form of the axiom required by the system.
The informal specification may be in a restricted subset of English that
unambiguously signals its predicte-argument and implicational relationships.
Or it could be a kind of VSO, parenthesized expression, such as

     cause(correct procedure, no damage)

We have described the top-down phase of knowledge acquisition as though
the relevant core theories were all there.  But of course especially in 
the early stages, the constructions of the core theories will be very
much driven by the bottom up processing.  We will have a number of facts,
and we need to construct the core theories in a way that makes it possible
to state these facts.