web1913.a web1913.b web1913.c web1913.d web1913.e web1913.f web1913.g web1913.h web1913.i web1913.j web1913.k web1913.l web1913.m web1913.n web1913.new web1913.o web1913.pq web1913.r web1913.s web1913.t web1913.u web1913.v web1913.w web1913.xyz
These files are faithful renditions of the entries in Webster's 2nd International Dictionary (no longer in copyright). In constructing the electronic version of the dictionary, the various components of entries have been tagged in a structure similar to XML tags; that is, each field opens with a tagname between angle brackets, and closes with a similar tag containing the forward slash. The basic structure of an entry is as it would be in a normal dictionary: headword, pronunciation, part-of-speech, etymology, field of knowledge, sense number, definition, illustrative quotations, notes, and synonyms. Approximately 120 tags are used to identify fields; these are augmented with various formatting and typescript codes, which capture the layout of the paper dictionary and also facilitate direct viewing in HTML files.
a.defs b.defs c.defs d.defs e.defs f.defs g.defs h.defs i.defs j.defs k.defs l.defs m.defs n.defs new.defs o.defs pq.defs r.defs s.defs t.defs u.defs v.defs w.defs xyz.defs
The initial data (web1913.* files) have been preprocessed to extract a subset of the information and to put it into a format suitable as input to further processes (Brill tagging, chunking, and parsing, as described below). The format of entries in these files is as follows:
<entry>
{<hw>words</hw>
<stress>words</stress>?}+
<plural>words</plural>?
<colref>word</colref>?
{<asp>word<asp>*
<sense>
<pos>word</pos>
<uniqhw>word</uniqhw>?
<uniqsn>word</uniqsn>?
<sn>word</sn>
<subsense>?
<subsn>word</subsn>?
<uniqhw>word</uniqhw>?
<fld>words</fld>?
<uniqdef>words</uniqdef>?
<def>words</def>+
<as>words</as>*
<note>words</note>*
<quote>words</quote>*
<comment>words</comment>*
</subsense>?
</sense>}+
</entry>
Legend
? zero or one * zero or more + one or more {} grouping as example usage asp alternate spelling comment remark def definition fld sense field hw headword note note plural pluralization pos part of speech quote example quotation stress headword with stress markings sn sense number subsn subsense number uniqhw unique headword ID uniqdef unique definition number uniqsn unique sense ID
a.tag b.tag c.tag d.tag e.tag f.tag g.tag h.tag i.tag j.tag k.tag l.tag m.tag n.tag new.tag o.tag pq.tag r.tag s.tag t.tag u.tag v.tag w.tag xyz.tag
Brill's tagger takes input in the form of sentences, one sentence per line, and assigns one part-of-speech tag to each word (see the tagset below--Brill's tagset does not include MOD).
a.pc b.pc c.pc d.pc e.pc f.pc g.pc h.pc i.pc j.pc k.pc l.pc m.pc n.pc new.pc o.pc pq.pc r.pc s.pc t.pc u.pc v.pc w.pc xyz.pc
The PC format contains chunked information about the definitions. We use the term chunking to mean breaking up into phrases somewhat more crudely than a parser. The "pc" ("post-chunker") format is shown below. A boldface symbol indicates an actual symbol used in this work. Words may contain punctuation, numbers, and solidi (division slashes). {HEADWORDLINE DEFINITION+}+
HEADWORDLINE := <hw>({(WORD{ WORD}*)}+)</hw> <pos>POS</pos> <sn>NUMBER</sn>
POS := adj | adj-compar | adj-fem | adj-masc | adj-superl | adv | adv-interrog | adv-prep | affix-combining | affix-prefix | affix-suffix | art-def | collocate | conj | interj | n | n-adj | n-collect | n-fem | n-masc | n-pl | n-pl-pres | n-sing | n-v | part | part-adj | part-past | part-past-fem | part-past-obs | part-pres | prep | pron | pron-adj | pron-indef | pron-pl | pron-poss | undefined | v | v-2-sing-pres | v-3-sing | v-3-sing-pres | v-3-sing-subj | v-aux | v-i | v-imper | v-imper-obs | v-imper-sing | v-imper-strong | v-imper-strong-obs | v-impers-pres | v-inf | v-t
DEFINITION := [ SENT {BRILLWORD|PHRASE}+ ]
PHRASE := [ TYPE {BRILLWORD|PHRASE}+ ]
TYPE := NP | PP | VC | VP | REL | ?
BRILLWORD := WORD/TAG
TAG := CC | CD | DT | EX | FW | IN | JJ | JJR | JJS | LS | MD | MOD | NN | NNP | NNPS | NNS | PDT | POS | PRP | PRP$ | RB | RBR | RBS | RP | SYM | TO | UH | VB | VBD | VBG | VBN | VBP | VBZ | WDT | WP | WP$ | WRB | # | $ | . | , | : | ( | ) | " | ` | `` | ' | ''
a.spl1 b.spl1 c.spl1 d.spl1 e.spl1 f.spl1 g.spl1 h.spl1 i.spl1 j.spl1 k.spl1 l.spl1 m.spl1 n.spl1 new.spl1 o.spl1 pq.spl1 r.spl1 s.spl1 t.spl1 u.spl1 v.spl1 w.spl1 xyz.spl1
SPL (Sentence Planning Language) is a format from which English sentences can be generated by computer. The SPL1 format is an initial attempt to convert the chunked data above into SPL.
W2-DIMAP.A.ZIP W2-DIMAP.B.ZIP W2-DIMAP.C.ZIP W2-DIMAP.D.ZIP W2-DIMAP.E.ZIP W2-DIMAP.F.ZIP W2-DIMAP.G.ZIP W2-DIMAP.H.ZIP W2-DIMAP.I.ZIP W2-DIMAP.J.ZIP W2-DIMAP.K.ZIP W2-DIMAP.L.ZIP W2-DIMAP.M.ZIP W2-DIMAP.N.ZIP W2-DIMAP.NEW.ZIP W2-DIMAP.O.ZIP W2-DIMAP.PQ.ZIP W2-DIMAP.R.ZIP W2-DIMAP.S.ZIP W2-DIMAP.T.ZIP W2-DIMAP.U.ZIP W2-DIMAP.V.ZIP W2-DIMAP.W.ZIP W2-DIMAP.XYZ.ZIP
The preprocessed definitions have been converted into DIMAP dictionaries suitable for parsing in the DIMAP functionality. A demo version of DIMAP is available at CL Research that will parse the definitions, identify hypernyms (where possible), and perform a comparison of the identified hypernym with the WordNet hierarchy. As the demo version is improved to identify more semantic relations, these files will be used.
k.par n1.par o.par u.par v.par xyz.par The preprocessed definitions have been converted into DIMAP dictionaries and each definition has been parsed using the integrated Proximity Technology Inc. parser. A preprocessed definition consists of:
- a definition phrase (usually not a complete sentence) terminated by a period, frequently containing semicolon-delimited phrases, and
- (optionally) one or more sentences providing a gloss on the definition (frequently encyclopedic information).
Each semicolon-delimited phrase has its own parse. The "parsed definitions" files are ASCII text, with each parse result as a list containing the following sublists:
hw headword hw-pos part of speech sn sense number fld subject field (if any) def definition parse parse semrels semantic relations The parse is also a list, in the form of a tree, with nonterminals of the grammar indicated by capital letters and terminals indicated by lowercase letters. Nonterminals may be annotated. Terminal nodes identify the part of speech and the root form of the word, along with its parse-dictionary definition, which gives further properties used in parsing.
Semantic relations are presented in the form of triples, with the format (word relation word). Relations are determined by analyzing the pattern of the parse tree. At present, these include (where determinable from the parse tree):
- Hypernym (hyp)
- For noun definitions, the head noun(s) of the first NP. For verb definitions, the first verb in the definition (or several, where the first verb is in a conjunctive phrase).