USC/ISI Automatic Data Extraction Demo
Step 4: Extracting ISBN from test pages using data
prototype
Using the patterns of the data prototype
learned from the training examples (Step 1), we
identify possible extracts on the new (test) pages. Some patterns are specific
and identify only the correct examples of the data field; however, other
patterns are general and identify many extraneous examples. The test pages
in this example came from the same source (amazon.com). However, because
the layout of the source has changed, the wrapper was not extracting the
data from the new pages.
ISBN identified by the
data prototype on test pages
|
Page 1
0590353403 |
Page 2
0399144498 |
Page 3
0127016252 |
|
Page 4
0195111303 |
Page 5
0070428077 |
Page 6
0374522928 |
|
Page 7
0374106452 |
Page 8
0262581116 |
Page 9
0375705694 |
etc...
Next: View
final results
Copyright: USC Information Sciences Institute 2000