USC/ISI Automatic Data Extraction Demo

Step 2: Learning data prototype

We use machine learning algorithm to find the common patterns of tokens that describe the start and end of the training examples in Step 1. The tokens include general tokens, such as CAPITALized or ALPHAbetic words, as well as specific words like Smith. The legend appears below.
 

Source: amazon.com

Next
AUTHOR
STARTING PATTERNS  ENDING PATTERNS
  _AL _CP 
  _AL _AC . _CP 
  _CP _CP 
  _CP _AC . _CP 
  _CP _CP 
  _AC . _CP 

TITLE

STARTING PATTERNS ENDING PATTERNS
 _CP _CP 
  _CP _CP _LC the 
  _CP _CP ( 
  _CP in 
  The _CP _CP 
 _CP _AL _CP 
  ' s _CP 
  _CP _AL ) 

ISBN

STARTING PATTERNS ENDING PATTERNS
  _NGG    _NGG 

PRICE

STARTING PATTERNS ENDING PATTERNS
  _N2D . 00 
  _N2D . 36 
  _N2D . 80 
  _N2D . 96 
  _N2D . 95 
  10 . 36 
 _N2D . _N2D 
  21 . _N2D 
  14 . _N2D 
  11 . _N2D 
  10 . _N2D 
  21 . 00
  10 . 36 

AVAILABILITY

STARTING PATTERNS ENDING PATTERNS
  _CP ships within 24 hours 
  _CP title usually ships within 2 - 3 days 
  Usually ships within 24 hours 
  This title usually ships within 2 - 3 days 
  Usually ships within 24 _LC 
  This title usually ships within 2 - 3 _LC 
  Usually ships within 24 hours 
  This title usually ships within 2 - 3 days

Legend:
_AC..............All caps words
_AL..............Alphabetic words
_AN.............Alphanumeric tokens
_CP..............Capitalized words
_LC..............Lowercase words
_NM.............Number
_N1D............1 digit number
_N2D............2 digit number
_N3D............3 digit number
_NLG............Large number


Next




Copyright: USC Information Sciences Institute 2000