Nick Kushmerick
last updated 11 May 1998
First of all, here is the "home page" for my wrapper induction work.
This zip file contains a directory for each of the sites: the 30 surveyed sites (see Kushmerick PhD, Figure 7.1), plus OKRA, BIGBOOK, COREL, and ALTAVISTA. Within each subdirectory are several additional subdirectories.
;; correct label for #p"/homes/gws/nicku/wi/sense/a/exp/domains/bigbook/examples/bigbook-1.html" (((3856 3871) (3919 3932) (3976 3984) (3986 3988) (4042 4045) (4047 4055)) ((4624 4642) (4690 4707) (4751 4758) (4760 4762) (4816 4819) (4821 4829)) ((5398 5416) (5464 5477) (5521 5532) (5534 5536) (5590 5593) (5595 5603)) ((5981 5999) (6047 6059) (6103 6114) (6116 6118) (6172 6175) (6177 6185)) ((6563 6591) (6639 6656) (6700 6707) (6709 6711) (6765 6768) (6770 6778)) ((7156 7171) (7219 7232) (7276 7284) (7286 7288) (7342 7345) (7347 7355)) ((7924 7949) (7997 8007) (8051 8059) (8061 8063) (8117 8120) (8122 8130)) ((8699 8720) (8768 8780) (8824 8837) (8839 8841) (8895 8898) (8900 8908)) ((9477 9486) (9534 9547) (9591 9599) (9601 9603) (9657 9660) (9662 9670)) ((10239 10266) (10314 10327) (10371 10379) (10381 10383) (10437 10440) (10442 10450)) ((11019 11043) (11091 11103) (11147 11155) (11157 11159) (11213 11216) (11218 11226)) ((11795 11819) (11867 11889) (11933 11941) (11943 11945) (11999 12002) (12004 12012)) ((12581 12598) (12646 12659) (12703 12711) (12713 12715) (12769 12772) (12774 12782)) ((13351 13366) (13414 13425) (13469 13475) (13477 13479) (13533 13536) (13538 13546)) ((13924 13951) (13999 14014) (14058 14066) (14068 14070) (14124 14127) (14129 14137)) ((14706 14725) (14773 14787) (14831 14844) (14846 14848) (14902 14905) (14907 14915)) ((15484 15509) (15557 15570) (15614 15622) (15624 15626) (15680 15683) (15685 15693)) ((16262 16278) (16326 16341) (16385 16397) (16399 16401) (16455 16458) (16460 16468)) ((17037 17067) (17115 17127) (17171 17179) (17181 17183) (17237 17240) (17242 17250)) ((17819 17833) (17881 17893) (17937 17945) (17947 17949) (18003 18006) (18008 18016)))These files are easiest to read in Lisp. The first line is just a comment. Subsequent lines form a list of the form
( tuple* )where each tuple specification is a list of the form
( attribute-1 attribute-2 ... attribute-K )where each attribute-k is a list of the form
( begin end )where begin is the index into the page indicating the first position of this attribute value, and end is the position of the last character in the attribute value.
For example, the sixth (right-most) attribute of the twentieth (last) tuple of bigbook-1.html begin at position 18008 and ends at position 18016. Thus:
perl -e 'print substr(join("",`cat bigbook-1.html`),18008,18016-18008)'
prints out 686-8938, which is indeed the last telephone number.