Resources

Initial Data

web1913.a
web1913.b
web1913.c
web1913.d
web1913.e
web1913.f
web1913.g
web1913.h
web1913.i
web1913.j
web1913.k
web1913.l
web1913.m
web1913.n
web1913.new
web1913.o
web1913.pq
web1913.r
web1913.s
web1913.t
web1913.u
web1913.v
web1913.w
web1913.xyz

These files are faithful renditions of the entries in Webster's 2nd International Dictionary (no longer in copyright). In constructing the electronic version of the dictionary, the various components of entries have been tagged in a structure similar to XML tags; that is, each field opens with a tagname between angle brackets, and closes with a similar tag containing the forward slash. The basic structure of an entry is as it would be in a normal dictionary: headword, pronunciation, part-of-speech, etymology, field of knowledge, sense number, definition, illustrative quotations, notes, and synonyms. Approximately 120 tags are used to identify fields; these are augmented with various formatting and typescript codes, which capture the layout of the paper dictionary and also facilitate direct viewing in HTML files.

Preprocessed Definitions

a.defs
b.defs
c.defs
d.defs
e.defs
f.defs
g.defs
h.defs
i.defs
j.defs
k.defs
l.defs
m.defs
n.defs
new.defs
o.defs
pq.defs
r.defs
s.defs
t.defs
u.defs
v.defs
w.defs
xyz.defs

The initial data (web1913.* files) have been preprocessed to extract a subset of the information and to put it into a format suitable as input to further processes (Brill tagging, chunking, and parsing, as described below). The format of entries in these files is as follows:

<entry>
{<hw>words</hw>
  <stress>words</stress>?}+
  <plural>words</plural>?
  <colref>word</colref>?
{<asp>word<asp>*
  <sense>
    <pos>word</pos>
    <uniqhw>word</uniqhw>?
    <uniqsn>word</uniqsn>?
    <sn>word</sn>
    <subsense>?
      <subsn>word</subsn>?
      <uniqhw>word</uniqhw>?
      <fld>words</fld>?
      <uniqdef>words</uniqdef>?
      <def>words</def>+
      <as>words</as>*
      <note>words</note>*
      <quote>words</quote>*
      <comment>words</comment>*
    </subsense>?
  </sense>}+
</entry>
Legend
? zero or one
* zero or more
+ one or more
{} grouping

as example usage
asp alternate spelling
comment remark
def definition
fld sense field
hw headword
note note
plural pluralization
pos part of speech
quote example quotation
stress headword with stress markings
sn sense number
subsn subsense number
uniqhw unique headword ID
uniqdef unique definition number
uniqsn unique sense ID

Tagged by Brill's Tagger

a.tag
b.tag
c.tag
d.tag
e.tag
f.tag
g.tag
h.tag
i.tag
j.tag
k.tag
l.tag
m.tag
n.tag
new.tag
o.tag
pq.tag
r.tag
s.tag
t.tag
u.tag
v.tag
w.tag
xyz.tag

Brill's tagger takes input in the form of sentences, one sentence per line, and assigns one part-of-speech tag to each word (see the tagset below--Brill's tagset does not include MOD).

Syntactic Phrase Identification

a.pc
b.pc
c.pc
d.pc
e.pc
f.pc
g.pc
h.pc
i.pc
j.pc
k.pc
l.pc
m.pc
n.pc
new.pc
o.pc
pq.pc
r.pc
s.pc
t.pc
u.pc
v.pc
w.pc
xyz.pc

The PC format contains chunked information about the definitions. We use the term chunking to mean breaking up into phrases somewhat more crudely than a parser. The "pc" ("post-chunker") format is shown below. A boldface symbol indicates an actual symbol used in this work. Words may contain punctuation, numbers, and solidi (division slashes).

{HEADWORDLINE DEFINITION+}+
HEADWORDLINE := <hw>({(WORD{ WORD}*)}+)</hw> <pos>POS</pos> <sn>NUMBER</sn>
POS := adj | adj-compar | adj-fem | adj-masc | adj-superl | adv | adv-interrog | adv-prep | affix-combining | affix-prefix | affix-suffix | art-def | collocate | conj | interj || n-adj | n-collect | n-fem | n-masc | n-pl | n-pl-pres | n-sing | n-v | part | part-adj | part-past | part-past-fem | part-past-obs | part-pres | prep | pron | pron-adj | pron-indef | pron-pl | pron-poss | undefined || v-2-sing-pres | v-3-sing | v-3-sing-pres | v-3-sing-subj | v-aux | v-i | v-imper | v-imper-obs | v-imper-sing | v-imper-strong | v-imper-strong-obs | v-impers-pres | v-inf | v-t
DEFINITION := [ SENT {BRILLWORD|PHRASE}+ ]
PHRASE := [ TYPE {BRILLWORD|PHRASE}+ ]
TYPE := NP | PP | VC | VP | REL | ?
BRILLWORD := WORD/TAG
TAG := CC | CD | DT | EX | FW | IN | JJ | JJR | JJS | LS | MD | MOD | NN | NNP | NNPS | NNS | PDT | POS | PRP | PRP$ | RB | RBR | RBS | RP | SYM | TO | UH | VB | VBD | VBG | VBN | VBP | VBZ | WDT | WP | WP$ | WRB |||||||||| `` || ''

SPL-like format

a.spl1
b.spl1
c.spl1
d.spl1
e.spl1
f.spl1
g.spl1
h.spl1
i.spl1
j.spl1
k.spl1
l.spl1
m.spl1
n.spl1
new.spl1
o.spl1
pq.spl1
r.spl1
s.spl1
t.spl1
u.spl1
v.spl1
w.spl1
xyz.spl1

SPL (Sentence Planning Language) is a format from which English sentences can be generated by computer. The SPL1 format is an initial attempt to convert the chunked data above into SPL.

W2 DIMAP Dictionaries

W2-DIMAP.A.ZIP
W2-DIMAP.B.ZIP
W2-DIMAP.C.ZIP
W2-DIMAP.D.ZIP
W2-DIMAP.E.ZIP
W2-DIMAP.F.ZIP
W2-DIMAP.G.ZIP
W2-DIMAP.H.ZIP
W2-DIMAP.I.ZIP
W2-DIMAP.J.ZIP
W2-DIMAP.K.ZIP
W2-DIMAP.L.ZIP
W2-DIMAP.M.ZIP
W2-DIMAP.N.ZIP
W2-DIMAP.NEW.ZIP
W2-DIMAP.O.ZIP
W2-DIMAP.PQ.ZIP
W2-DIMAP.R.ZIP
W2-DIMAP.S.ZIP
W2-DIMAP.T.ZIP
W2-DIMAP.U.ZIP
W2-DIMAP.V.ZIP
W2-DIMAP.W.ZIP
W2-DIMAP.XYZ.ZIP

The preprocessed definitions have been converted into DIMAP dictionaries suitable for parsing in the DIMAP functionality. A demo version of DIMAP is available at CL Research that will parse the definitions, identify hypernyms (where possible), and perform a comparison of the identified hypernym with the WordNet hierarchy. As the demo version is improved to identify more semantic relations, these files will be used.

Sample SENSEVAL parses

k.par
n1.par
o.par
u.par
v.par
xyz.par

The preprocessed definitions have been converted into DIMAP dictionaries and each definition has been parsed using the integrated Proximity Technology Inc. parser. A preprocessed definition consists of:

  1. a definition phrase (usually not a complete sentence) terminated by a period, frequently containing semicolon-delimited phrases, and
  2. (optionally) one or more sentences providing a gloss on the definition (frequently encyclopedic information).

Each semicolon-delimited phrase has its own parse. The "parsed definitions" files are ASCII text, with each parse result as a list containing the following sublists:

hw headword
hw-pos part of speech
sn sense number
fld subject field (if any)
def definition
parse parse
semrels semantic relations

The parse is also a list, in the form of a tree, with nonterminals of the grammar indicated by capital letters and terminals indicated by lowercase letters. Nonterminals may be annotated. Terminal nodes identify the part of speech and the root form of the word, along with its parse-dictionary definition, which gives further properties used in parsing.

Semantic relations are presented in the form of triples, with the format (word relation word). Relations are determined by analyzing the pattern of the parse tree. At present, these include (where determinable from the parse tree):

Hypernym (hyp)
For noun definitions, the head noun(s) of the first NP. For verb definitions, the first verb in the definition (or several, where the first verb is in a conjunctive phrase).