WHAT'S IN HERE In a word, dozens of instances of extraction problems (111 all told). xxx-snap.tar.gz unpacks to contain the following information. You'll want to untar these into different directories to avoid name clashes. Spitprogdb.pl is Perl a database of extraction problems, each consisting of - tag: an identifier for this problem - url: the original location of the html being converted - site: host for the url - file: the cached version of the url (a file of the form yyy.CACHE/nnnn.html) - prog: the spit program used to convert the file - edit: a perl edit command used to preprocess the file (only occasionally used) A perl script used this file to create extracted versions of the files as follows: all benchmarks for the same 'file' are grouped together, given a randomly chosen tag, and run together; the resulting .stir is placed in the file yyy.XML/tag.stir where 'tag' is a randomly chosen tag for this set. This is less obscure than it sounds -- usually there is only one program for a file. yyy.HTML/nnn.html contains copies of original HTML files yyy.PP/nnn.html contains a "pretty" version of each HTML file; this has been (sometimes) cleaned up by applying the perl edit command, and (always) cleaned up by making the implicit close brackets explicit. yyy.XML/zzz.{stir,MAP.gz} contains extracted versions of these files the (.stir) files are extracted, and the .MAP.gz files text map entities in the .stir files to nodes in the HTML parse trees. the .stir files are NOT in precise 1-1 correspondence with the files in the HTML directory, see the Spitprogdb.pl file for details. STATISTICS ON THE DATA I performed a few simplifications of the original extraction scripts; Perl edits are applied in advance, site & base information is ignored, and where conditions that implement very simple filters or scheduling have been are deleted. Here's a breakdown of the constructs used in the simplified extraction programs. - basic programs have no context, pseudovariables, or splits; - wrap programs have context codes (encoded in the 'wrap' relation); in other words, the output is nested lists, as opposed to flat tuples. - everything else is 'other' domain #basic #wrap #other total ---------------------------------------- birds 29 13 12 54 cdroms 21 2 13 36 movies 10 0 1 11 news 8 0 2 10 ---------------------------------------- 68 15 28 111