Page Model
-
ONE_HTML_PAGE ::= List( DESCRIPTION )
-
DESCRIPTION ::= < Score, Name, Email, FirstEntered >
NOTE:
- each HTML page is seen as a list of 4-tuples
, and all four
items are always present and always in the
same order.
In database terms, an HTML page represents a table with four
columns (i.e., Score, Name, Email, and FirstEntered).
Items to be extracted: Score, Name, Email, FirstEntered.
IMPORTANT: the original wrapping by Kushmerick, which was
replicated by Muslea for sake of comparison, requires two
important clarifications
- the email address appears twice in each tuple (i.e., as
an HTML anchor for both Name and Email). The
instance of the email address that must be extracted is the one
within the anchor of Name field.
- some pages include an Organization field, which is
ignored in the original wrapper induction task (see
Sample page 2).
SAMPLE EXTRACTION OUTPUT