LFToolkit


Table of contents

Description:

LFToolkit is a program that can be used to get the Logical Form of english sentences. It takes a parse tree as input and translates it to get the logical representation based on the transformation rules specified by the user. In general, it can operate on any kind of tree structures. Other uses of this program can be easily thought of. LFToolkit was developed by Nishit Rathod.

Download LFToolkit:

The program is written in Java and is available for download in the form of a jar file:

It has been compiled using J2SE (version: 1.4.2_05) on Win XP. You will also need the Xerces2 Java Parser. You can download it from http://xml.apache.org/xerces2-j/. The Xerces jar file needs to be in the Classpath of the system.

Using LFToolkit:

LFToolkit can be used in one of two ways:

Before looking at ways to invoke the program we will dicuss the input that the program takes.

LFToolkit takes 2 files as input:

XML file:

The XML file represents the parse tree of an english sentence (or phrase). Consider the following Noun Phrase:

a black box
i.e. NP -> DT JJ NN
In Penn Treebank-style format: (NP (DT a) (JJ black) (NN box))

This parse tree can be represented as an XML document like so:

<treebank>
<tree>
<NP>
<DT>a</DT>
<JJ>black</JJ>
<NN>box</NN>
</NP>
</tree>
</treebank>

XML-1

There is no restriction on the XML schema. The same parse tree can also be represented in the following manner:

<treebank>
<tree>
<pe synt="NP">
<pe synt="DT" lex="a">a</pe>
<pe synt="JJ" lex="black">black</pe>
<pe synt="NN" lex="box">box</pe>
</pe>
</tree>
</treebank>

XML-2

Transformation Rules file:

This is the most important part. The transformation rules file is divided into three sections in the following order:

Variables:
Here we define the list of variables used in the Logical representation. It starts with the following line :
#Variables


Entry points:
This section gives the mapping between the root of the XML node and the corresponding transformation rule to be used to process that node. This section starts with the following line:
#EntryPoints


Transformation Rules:
This section lists the rules used to get the output logical representation given a parse tree. This section starts with the following line:
#TransformationRules


To understand the file format we will look at a few examples. The program offers a lot of power and flexibility. We will start with a simple example and make it more and more complex as we progress. Lets start with the tree shown in XML-1. The desired logical form would be:

box'(e0,x0)&black'(e1,x0)

The transformation rules file needed to achieve this transformation is:

#Variables
x = "x"
e = "e"

#EntryPoints
NP -> <NP,e-1,x-1>

//this is a comment
//blank lines are ignored

#TransformationRules

//--------------------- Noun Phrases ----------------------

<NP,e-1,x-1> -> DT JJ:<OUT,e-2,x-1> NN:<OUT,e-1,x-1>

//--------------------- Output -------------------------

<OUT,any-1,any-2> => .lex+"'("+any-1+","+any-2+")"

Lets go through each section of this file. In the variables section we define 2 variables having names x and y with the corresponding output strings being "x" and "y", respectively. We will see how these strings are used when we discuss the format of transformation rules.

In the second section we establish the mapping between the root of the parse tree and the corresponding transformation rule to be used to process the parse tree. In the above example we state that if the root of the parse tree is NP, the transformation rule corresponding to the signature <NP,e-1,x-1> is to be used. Now, let us undertand the meaning of the rule signature: <NP,e-1,x-1>. A transformation rule can be thought of as a function call and the rule signature can be thought of as the function declaration. The first symbol within the angled brackets (NP in this case) would be the function name (id) and the rest would be the function arguments. Please note the manner in which variables are used. In the above example e-1, x-1, any-1, etc. are all variables. The symbol preceding the hyphen indicates the "type" (as defined in the Variables section) of the variable and the number following the hyphen uniquely identifies a particular instance of the variable locally, within the scope of a transformation rule. any is a special kind of variable that can be used as a wildcard.

In the third section the transformation rules are listed. A transformation rule may either result in some kind of output (Output rules) or no output at all (Non-output rules). Each transformation rule has a Left-Hand Side, a Right-Hand Side and a connecting symbol. For output rules "=>" is used as the connecting symbol and for non-output rules "->" is used. In either case, the left-hand side of a transformation rule represents the rule signature (discussed above). The RHS of non-output rules is nothing but the expansion pattern of the current node along with the rule signatures identifying the rules to be used to furthur process the children of the current node. Lets try to understand what this rule means:

<NP,e-1,x-1> -> DT JJ:<OUT,e-2,x-1> NN:<OUT,e-1,x-1>
This rule states that if the current node expands into a DT, a JJ and an NN, then do not process the DT-node, process the JJ-node with the transformation rule having the rule signature <OUT,e-2,x-1> and process the NN-node with the transformation rule having the signature <OUT,e-2,x-1>.

Now consider the following output rule:

<OUT,any-1,any-2> => .lex+"'("+any-1+","+any-2+")"
The RHS of an output rule consists of a sequence of strings concatenated with the + operator. The program provides a mechanism to refer to attributes of the XML nodes. In the above example .lex refers to an attribute named "lex" of the current node. Now we look at how variables are initialized. Associated with each variable (defined in the Variables section) is a counter. Every time a new instance of a variable is encountered the corresponding counter is incremented. A variable's output string and the counter value together determine the value of an instance of the variable. Initially all counters will be initialized to 0. Suppose the variable e-1 is encountered at runtime. The value of this variable will be the concatenation of the output string of the variable e and the current value of the counter for e variable i.e. the string "e0". Similarly, the value of the variable e-2 will be the string "e1".

The following diagram should make it clear:

diag-1

Please note that the outputs are concatenated with "&" (ampersand).

Now lets consider the tree shown in XML-2. Here all the node have the same name ("pe"). They can be distinguished by using the value of the attribute "synt". This can be achieved by using the following syntax:

element_name@attribute_name="attribute_value"
For example:
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
The entry points will change accordingly. Here's the new transformation rules file:

#Variables
x = "x"
e = "e"

#EntryPoints
pe@synt="NP" -> <NP,e-1,x-1>

//this is a comment
//blank lines are ignored

#TransformationRules

//--------------------- Noun Phrases ----------------------

<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>

//--------------------- Output -------------------------

<OUT,any-1,any-2> => .lex+"'("+any-1+","+any-2+")"

We can also condition on multiple attributes:

<NP,e-1,x-1> -> pe@synt="DT"@lex="a" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
Here, first child must have an attribute called "synt" with value "DT" and another attribute called "lex" having value "a". Suppose that we wanted the attribute "lex" of the first child to have the value "a" or "the". We can achieve this using the following rule:
<NP,e-1,x-1> -> pe@synt="DT"@lex="a,the" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
In general, we can use a comma-separated list for the attribute value. Now, suppose you wanted to make the determiner optional. You could do this by having the following 2 rules:
<NP,e-1,x-1> -> pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
You even want the adjective to be optional. So you expand the rule-base further:
<NP,e-1,x-1> -> pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="NN":<OUT,e-1,x-1>
Imagine having a rule for each expansion pattern for English language. Even for a decent coverage we would need hundreds of such transformation rules. It would be kind of unmanageable in the end. To simplify this we have added a nice feature to the program. You can use Regular Expressions to specify the transformation rules. Here's how you could compactly represent the above 4 rules:
<NP,e-1,x-1> -> pe@synt="DT"? pe@synt="JJ":<OUT,e-2,x-1>? pe@synt="NN":<OUT,e-1,x-1>
We have used the ? quantifier in the above example. You can use the following features of Regular Expressions (familiarity with RE is assumed):

Variables within Kleene Closure

In the rule

<NP,e-1,x-1,any-1> -> pe@synt="S-DET"? pe@synt="S-ADJ":<ADJ,e-n,x-1>* pe@synt="S-NOUN":<NOUN,e-1,x-1>
The variable 'e-n' is a special kind of variable. Each instance of 'S-ADJ' will be initialized with a different value for the variable 'e-n'.

Accumulating variables

Consider the following Noun Phrase:

NP: the green apples
Lets say, we'd like to have the following logical form representation for it:
apple’(e0,x0) & green’(e1,x0) & dset(s0,x0,e0&e1)

The transformation rule to achieve this is:

NP<e-1,s-1> -> DT? JJ:<OUT,e-n1,x-1>* NN:<PNN,e-n2,x-n1,x-1>* NNS:<OUTPlu,e-1,x-1,s-1,av<e-n1&en2>>
Here we make use of the special operator av<list_of_variables>. The list of variables is separated by the ampersand (&). The operators makes sure that the uninitialized variables are not printed in the output. In the above example the variable e-n2 is not initialized (since the NP does not contain a prenominal noun) and, hence, it is not used in the output. LF Fragments

This feature lets you process an input tree even if there are any nodes that are not covered by the transfrmation rules. The logical form representation would only be partially correct. In that the variables are not correctly linked.

Consider the following transformation rules file:

#Variables 
x = "x"
e = "e"

#EntryPoints
#ep1
S -> <S,e-1>
#ep2
JJ -> <OUT,e-1,x-1>
#ep3
NN -> <OUT,e-1,x-1>

#TransformationRules

<S,e-1> -> NP:<NP,e-2,x-1> VP:<VP,e-1,x-1>

//--------------------- Noun Phrases ----------------------
<NP,e-1,x-1> -> DT NN:<OUT,e-1,x-1>
<NP,e-1,x-1> -> DT JJ:<OUT,e-2,x-1> NN:<OUT,e-1,x-1>


//--------------------- Verb Phrases ----------------------
<VP,e-1,x-1> -> AUX ADJP:<ADJP,e-1,x-1>

//--------------------- Adjective Phrases ----------------------
<ADJP,e-1,x-1> -> JJ:<OUT,e-1,x-1>

//--------------------- Output -------------------------
<OUT,any-1,any-2> => . pred+"'("+any-1+","+any-2+")"

Let's say the following tree is processed using the above transformation rules file:


This is how it's processed:


There is no transformation rule that could be used to process the NP. The toolkit then looks if there are any entry points defined in the #EntryPoints section that can be used to process the children nodes of the NP. It locates entry points for NN and JJ and produces the output accordingly.

Using LFToolkit from command line

We have included a reference implementation in the distribution. You can invoke it from command line as follows:

% java -jar LFToolkit.jar transformation_rules_file input_xml_file output_file [attr1 attr2...]
Here the attribute list is optional. It is used to print the expansion strings of the nodes that could not be processed.

Sample usage:

% java -jar LFToolkit.jar sample.trf sample.xml output.txt synt

The xml file may contain any number of parse trees. The program will process each parse tree independently. It should be noted that while the program doesn't restrict the XML schema in any way, it expects the parse trees to be the children of the root node. The following example contains 2 parse trees:


<treebank system="contex" version="2.9.153" language="EN" date="Tue Nov 23 11:22:57 PST 2004">

<tree id="acerbity">
<pe surf="a sharp and bitter manner" synt="S-NP" lex="manner">
<pe surf="a" synt="S-INDEF-ART" lex="a" roles="DET">
<txt>a</txt>
</pe>
<pe surf="sharp and bitter" synt="S-ADJP" lex="sharp" roles="MOD">
<pe surf="sharp" synt="S-ADJ" lex="sharp" roles="PRED">
<txt>sharp</txt>
</pe>
<pe surf="and" synt="S-COORD-CONJ" lex="and" roles="CONJ">
<txt>and</txt>
</pe>
<pe surf="bitter" synt="S-ADJ" lex="bitter" roles="MOD">
<txt>bitter</txt>
</pe>
</pe>
<pe surf="manner" synt="S-NOUN" lex="manner" roles="PRED">
<txt>manner</txt>
</pe>
</pe>
</tree>

<tree id="acrimony">
<pe surf="a sharp and bitter manner" synt="S-NP" lex="manner">
<pe surf="a" synt="S-INDEF-ART" lex="a" roles="DET">
<txt>a</txt>
</pe>
<pe surf="sharp and bitter" synt="S-ADJP" lex="sharp" roles="MOD">
<pe surf="sharp" synt="S-ADJ" lex="sharp" roles="PRED">
<txt>sharp</txt>
</pe>
<pe surf="and" synt="S-COORD-CONJ" lex="and" roles="CONJ">
<txt>and</txt>
</pe>
<pe surf="bitter" synt="S-ADJ" lex="bitter" roles="MOD">
<txt>bitter</txt>
</pe>
</pe>
<pe surf="manner" synt="S-NOUN" lex="manner" roles="PRED">
<txt>manner</txt>
</pe>
</pe>
</tree>

</treebank>

XML-3

Samples:

Set 1 Set 2 Set 3 Set 4 Using LFToolkit from within a Java program

In order to use the toolkit from within your Java program you need to create an object of the LFToolkit class. You need to be familiar with the Xerces2 Java Parser. Here's a description of the functions of this class accessible to your program:

LFToolkit: Constructor
LFToolkit() Initializes a newly created LFToolkit object
LFToolkit(String ruleFilename) Initializes a newly created LFToolkit object and loads the transformation rules file pointed by ruleFilename


LFToolkit: Methods
load(String ruleFilename) Loads the transformation rules file pointed by ruleFilename
LOutput process(Node parseTree) Processes the Node (refer to Xerces2 documentation at http://xml.apache.org/xerces2-j/javadocs/api/index.html) using the transformation rules (specified using the load() function) and returns an object of the LOutput class (decsribed below). Here, parseTree is the xml representation of the parse tree to be processed. The root of the parse tree must be pointed to by parseTree. There must be an entry point for this node in the Entry-points section of the transformation rules file.

Here's a list of the functions of the LOutput class:

LOutput: Methods
boolean getStatus() Returns true if the processing was successful, false otherwise.
String getLString() Returns the logical form.
HashMap getVars() Returns a HashMap of the head variables of the logical form. The keys of this HashMap are the variable names used in the transformation rules file and the values are the values these variables are assigned in the logical representation.
Node getTrace() This function can be for debugging or for analyzing the working of the toolkit. The Node object returned by this function is a copy of the Node object passed to the process() function of the LFToolkit class but with some additional information added by the program. As the program processes the various nodes of the XML tree it adds an attribute with the name "ruleid". The value of this attribute is a number that identifies the transformation rule used to process that particular node. This number is zero-based and indicates the position of the transformation rule as listed in the transformation rules file. If a node doesn't possess this attribute, it means that the node wasn't processed by the program, most probably because it couldn't locate a matching rule.

Contact Us:

If you have questions or comments, email hobbs at isi.edu