LFToolkit
Table of contents
LFToolkit is a program that can be used to get the Logical Form of english sentences. It takes a parse tree as input and translates it to get the logical representation based on the transformation rules specified by the user. In general, it can operate on any kind of tree structures. Other uses of this program can be easily thought of. LFToolkit was developed by Nishit Rathod.
The program is written in Java and is available for download in the form of a jar file:
It has been compiled using J2SE (version: 1.4.2_05) on Win XP. You will also need the Xerces2 Java Parser. You can download it from http://xml.apache.org/xerces2-j/. The Xerces jar file needs to be in the Classpath of the system.
Using LFToolkit:LFToolkit can be used in one of two ways:
Before looking at ways to invoke the program we will dicuss the input that the program takes.
LFToolkit takes 2 files as input:
XML file:The XML file represents the parse tree of an english sentence (or phrase).
Consider the following Noun Phrase:
a black box
i.e. NP -> DT JJ NN
In Penn Treebank-style format: (NP (DT a) (JJ black) (NN box))
This parse tree can be represented as an XML document like so:
<treebank> |
| XML-1 |
|---|
<treebank> |
| XML-2 |
|---|
Transformation Rules file:
This is the most important part. The transformation rules file is divided into three sections in the following order:
To understand the file format we will look at a few examples. The program offers a lot of power and flexibility. We will start with a simple example and make it more and more complex as we progress. Lets start with the tree shown in XML-1. The desired logical form would be:
box'(e0,x0)&black'(e1,x0)
The transformation rules file needed to achieve this transformation
is:
#Variables |
Lets go through each section of this file. In the variables section we define 2 variables having names x and y with the corresponding output strings being "x" and "y", respectively. We will see how these strings are used when we discuss the format of transformation rules.
In the second section we establish the mapping between the root of the parse tree and the corresponding transformation rule to be used to process the parse tree. In the above example we state that if the root of the parse tree is NP, the transformation rule corresponding to the signature <NP,e-1,x-1> is to be used. Now, let us undertand the meaning of the rule signature: <NP,e-1,x-1>. A transformation rule can be thought of as a function call and the rule signature can be thought of as the function declaration. The first symbol within the angled brackets (NP in this case) would be the function name (id) and the rest would be the function arguments. Please note the manner in which variables are used. In the above example e-1, x-1, any-1, etc. are all variables. The symbol preceding the hyphen indicates the "type" (as defined in the Variables section) of the variable and the number following the hyphen uniquely identifies a particular instance of the variable locally, within the scope of a transformation rule. any is a special kind of variable that can be used as a wildcard.
In the third section the transformation rules are listed. A transformation rule may either result in some kind of output (Output rules) or no output at all (Non-output rules). Each transformation rule has a Left-Hand Side, a Right-Hand Side and a connecting symbol. For output rules "=>" is used as the connecting symbol and for non-output rules "->" is used. In either case, the left-hand side of a transformation rule represents the rule signature (discussed above). The RHS of non-output rules is nothing but the expansion pattern of the current node along with the rule signatures identifying the rules to be used to furthur process the children of the current node. Lets try to understand what this rule means:
<NP,e-1,x-1> -> DT JJ:<OUT,e-2,x-1> NN:<OUT,e-1,x-1>This rule states that if the current node expands into a DT, a JJ and an NN, then do not process the DT-node, process the JJ-node with the transformation rule having the rule signature <OUT,e-2,x-1> and process the NN-node with the transformation rule having the signature <OUT,e-2,x-1>.
Now consider the following output rule:
<OUT,any-1,any-2> => .lex+"'("+any-1+","+any-2+")"
The RHS of an output rule consists of a sequence of strings concatenated
with the + operator. The program provides a mechanism to refer to attributes
of the XML nodes. In the above example .lex refers to an attribute
named "lex" of the current node. Now we look at how variables are initialized.
Associated with each variable (defined in the Variables section) is a counter.
Every time a new instance of a variable is encountered the corresponding
counter is incremented. A variable's output string and the counter value
together determine the value of an instance of the variable. Initially all
counters will be initialized to 0. Suppose the variable e-1 is encountered
at runtime. The value of this variable will be the concatenation of the output
string of the variable e and the current value of the counter for
e variable i.e. the string "e0". Similarly, the value of the variable e-2
will be the string "e1".
The following diagram should make it clear:
|
Now lets consider the tree shown in XML-2. Here all the node have the same name ("pe"). They can be distinguished by using the value of the attribute "synt". This can be achieved by using the following syntax:
element_name@attribute_name="attribute_value"For example:
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>The entry points will change accordingly. Here's the new transformation rules file:
#Variables |
We can also condition on multiple attributes:
<NP,e-1,x-1> -> pe@synt="DT"@lex="a" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>Here, first child must have an attribute called "synt" with value "DT" and another attribute called "lex" having value "a". Suppose that we wanted the attribute "lex" of the first child to have the value "a" or "the". We can achieve this using the following rule:
<NP,e-1,x-1> -> pe@synt="DT"@lex="a,the" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>In general, we can use a comma-separated list for the attribute value. Now, suppose you wanted to make the determiner optional. You could do this by having the following 2 rules:
<NP,e-1,x-1> -> pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>You even want the adjective to be optional. So you expand the rule-base further:
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>Imagine having a rule for each expansion pattern for English language. Even for a decent coverage we would need hundreds of such transformation rules. It would be kind of unmanageable in the end. To simplify this we have added a nice feature to the program. You can use Regular Expressions to specify the transformation rules. Here's how you could compactly represent the above 4 rules:
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="JJ":<OUT,e-2,x-1> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="DT" pe@synt="NN":<OUT,e-1,x-1>
<NP,e-1,x-1> -> pe@synt="DT"? pe@synt="JJ":<OUT,e-2,x-1>? pe@synt="NN":<OUT,e-1,x-1>We have used the ? quantifier in the above example. You can use the following features of Regular Expressions (familiarity with RE is assumed):
Variables within Kleene Closure
In the rule
<NP,e-1,x-1,any-1> -> pe@synt="S-DET"? pe@synt="S-ADJ":<ADJ,e-n,x-1>* pe@synt="S-NOUN":<NOUN,e-1,x-1>The variable 'e-n' is a special kind of variable. Each instance of 'S-ADJ' will be initialized with a different value for the variable 'e-n'.
Accumulating variables
Consider the following Noun Phrase:
NP: the green applesLets say, we'd like to have the following logical form representation for it:
apple’(e0,x0) & green’(e1,x0) & dset(s0,x0,e0&e1)
The transformation rule to achieve this is:
NP<e-1,s-1> -> DT? JJ:<OUT,e-n1,x-1>* NN:<PNN,e-n2,x-n1,x-1>* NNS:<OUTPlu,e-1,x-1,s-1,av<e-n1&en2>>Here we make use of the special operator av<list_of_variables>. The list of variables is separated by the ampersand (&). The operators makes sure that the uninitialized variables are not printed in the output. In the above example the variable e-n2 is not initialized (since the NP does not contain a prenominal noun) and, hence, it is not used in the output. LF Fragments
This feature lets you process an input tree even if there are any nodes that are not covered by the transfrmation rules. The logical form representation would only be partially correct. In that the variables are not correctly linked.
Consider the following transformation rules file:
#Variables |
Let's say the following tree is processed using the above transformation rules file:
| |
This is how it's processed:
| |
There is no transformation rule that could be used to process the NP. The toolkit then looks if there are any entry points defined in the #EntryPoints section that can be used to process the children nodes of the NP. It locates entry points for NN and JJ and produces the output accordingly.
Using LFToolkit from command line
We have included a reference implementation in the distribution. You can invoke it from command line as follows:
% java -jar LFToolkit.jar transformation_rules_file input_xml_file output_file [attr1 attr2...]Here the attribute list is optional. It is used to print the expansion strings of the nodes that could not be processed.
Sample usage:
% java -jar LFToolkit.jar sample.trf sample.xml output.txt synt
The xml file may contain any number of parse trees. The program will process each parse tree independently. It should be noted that while the program doesn't restrict the XML schema in any way, it expects the parse trees to be the children of the root node. The following example contains 2 parse trees:
<treebank system="contex" version="2.9.153" language="EN" date="Tue Nov 23 11:22:57 PST 2004"> |
| XML-3 |
|---|
Samples:
Set 1% java -jar LFToolkit.jar sample.trf sample.xml sample.txt
#1
black'(e1,x0) & box'(e0,x0)
----
#2
bank'(e0,x0)
----
% java -jar LFToolkit.jar sample2.trf sample2.xml sample2.txt
#1
Unable to process the input.
The following node(s) could not be processed:
NP -> DT JJ NN
DT -> #text
JJ -> #text
NN -> #text
DT -> #text
#text ->
#text ->
Partial output:
----
#2
bank'(e0,x0)
----
% java -jar LFToolkit.jar sample3.trf sample3.xml sample3.txt lex
#1
Unable to process the input.
The following node(s) could not be processed:
NP -> DT@lex="a" JJ@lex="black" NN@lex="box"
DT@lex="a" -> #text
JJ@lex="black" -> #text
NN@lex="box" -> #text
DT@lex="a" -> #text
#text ->
#text ->
Partial output:
----
#2
bank'(e0,x0)
----
% java -jar LFToolkit.jar sample4.trf sample4.xml sample4.txt lex
#1
Unable to process the input.
The following node(s) could not be processed:
NP -> DT@lex="a" JJ@lex="black" NN@lex="box"
DT@lex="a" -> #text
DT@lex="a" -> #text
#text ->
#text ->
Partial output:
black'(e1,x1) & box'(e2,x2)
----
#2
bank'(e0,x0)
----
In order to use the toolkit from within your Java program you need to
create an object of the LFToolkit class. You need to be familiar
with the Xerces2 Java Parser. Here's a description of the functions of this
class accessible to your program:
| LFToolkit: Constructor | |
|
|---|---|---|
| LFToolkit() | Initializes a newly created LFToolkit object | |
| LFToolkit(String ruleFilename) | Initializes a newly created LFToolkit object and loads the transformation rules file pointed by ruleFilename | |
| LFToolkit: Methods | |
|
|---|---|---|
| load(String ruleFilename) | Loads the transformation rules file pointed by ruleFilename | |
| LOutput process(Node parseTree) | Processes the Node (refer to Xerces2 documentation at http://xml.apache.org/xerces2-j/javadocs/api/index.html) using the transformation rules (specified using the load() function) and returns an object of the LOutput class (decsribed below). Here, parseTree is the xml representation of the parse tree to be processed. The root of the parse tree must be pointed to by parseTree. There must be an entry point for this node in the Entry-points section of the transformation rules file. | |
Here's a list of the functions of the LOutput class:
| LOutput: Methods | |
|
|---|---|---|
| boolean getStatus() | Returns true if the processing was successful, false otherwise. | |
| String getLString() | Returns the logical form. | |
| HashMap getVars() | Returns a HashMap of the head variables of the logical form. The keys of this HashMap are the variable names used in the transformation rules file and the values are the values these variables are assigned in the logical representation. | |
| Node getTrace() | This function can be for debugging or for analyzing the working of the toolkit. The Node object returned by this function is a copy of the Node object passed to the process() function of the LFToolkit class but with some additional information added by the program. As the program processes the various nodes of the XML tree it adds an attribute with the name "ruleid". The value of this attribute is a number that identifies the transformation rule used to process that particular node. This number is zero-based and indicates the position of the transformation rule as listed in the transformation rules file. If a node doesn't possess this attribute, it means that the node wasn't processed by the program, most probably because it couldn't locate a matching rule. | |
If you have questions or comments, email hobbs at isi.edu