USC/ISI Automatic Data Extraction Demo

Step 2: Learning data prototype

We use machine learning algorithm to find the common patterns of tokens that describe the start and end of the training examples. The tokens include general tokens, such as CAPITALized or ALPHAbetic words, as well as specific words like Smith. The legend appears below.
 

Source: whitepages.com

Next
NAME
STARTING PATTERNS  ENDING PATTERNS
 _CP , _CP
  _CP , _CP and _CP
  _CP , A
  _CP , Anthony
  _CP , David and
  Smith ,
  Smith , A
  White ,
  White , A
  White , Anthony
  Rodham , _CP
  Roth , _CP
  Roth , _CP and
  Sutherland , David and
  Pfeiffer , _CP
 Smith , _AL
  White , _AL
  Rodham , _AL
  Roth , _AL
  Pfeiffer , _AL
  Gates , _AL
  , _CP and _AL
  Roth , _CP and _AL
  Sutherland , David and _AL
  Smith , _AC
  White , _AC
  Smith , A
  White , A
  White , _CP
  Rodham , _CP
  Roth , _CP
  Pfeiffer , _CP
  , _CP and _CP
  Roth , _CP and _CP
  Sutherland , David and _CP
  White , Anthony

ADDRESS

STARTING PATTERNS ENDING PATTERNS
 _AN _CP _CP 
  _AN _CP Ave 
  _AN _CP Ave S 
  _AN _CP St 
  _AN _CP Dr 
  _AN _N3D _CP Rd 
  _AN Park 
  _AC _N3D _CP Rd 
  _AC _NLG _CP _CP 
  W _NMM 
  _NM _CP _CP 
  _NM _CP Ave 
  _NM _CP Ave S 
  _NM _CP St 
  _NM _CP Dr 
  _NM Park 
  _N3D _CP St 
  _N3D _CP Dr 
  _N3D Park 
  _NLG _CP _CP 
  _NLG _CP Ave 
  _NLG _CP Ave S 
  _N2D _CP _AL 
  _N3D _CP _AL 
  _NLG _CP _AL 
  _NLG _CP Ave _AL 
  _N3D Park _AL 
  _N2D _CP _CP 
  _N3D _CP _CP 
  _NLG _CP _CP 
  _N3D Park _CP 
  _NM _CP Cir 
  _N3D _CP Ave 
  _NLG _CP Ave 
  _N3D _CP St 
  _AC _N3D _CP Rd 
  Ave _AC 
  _NLG _CP Ave _AC 
  Ave S 
  _NLG _CP Ave S 

CITY

STARTING PATTERNS ENDING PATTERNS
  _CP 
  _CP Angeles 
  Orlando 
  Syracuse 
  Goleta 
  Minneapolis 
  Los Angeles
   _CP 
  Orlando 
  Syracuse 
  Goleta 
  Minneapolis 
  Los Angeles 

STATE

STARTING PATTERNS ENDING PATTERNS
  _AC 
  FL 
  NY 
  CA 
  MN 
  _AC 
  FL 
  NY 
  CA 
  MN 

PHONE

STARTING PATTERNS ENDING PATTERNS
  _N3D - _N3D - _NLG 
  _N3D - 425 - 1788 
  _N3D - 851 - _NLG 
  _N3D - 898 - _NLG 
  _N3D - 476 - _NLG 
  _N3D - 967 - _NLG 
  407 - _N3D - _NLG 
  407 - 851 - _NLG 
  407 - 898 - _NLG 
  315 - _N3D - _NLG 
  315 - 425 - 1788 
  315 - 476 - _NLG 
  805 - _N3D - _NLG 
  805 - 967 - _NLG 
  630 - _N3D - _NLG 
  612 - _N3D - _NLG 
  323 - _N3D - _NLG 
  213 - _N3D - _NLG 
  206 - _N3D - _NLG 
  407 - _N3D - _NLG 
  315 - _N3D - _NLG 
  805 - _N3D - _NLG 
  630 - _N3D - _NLG 
  612 - _N3D - _NLG 
  323 - _N3D - _NLG 
  213 - _N3D - _NLG 
  206 - _N3D - _NLG 
  315 - 425 - _NLG 
  407 - 851 - _NLG 
  407 - 898 - _NLG 
  315 - 476 - _NLG 
  805 - 967 - _NLG 
  315 - 425 - 1788 

Legend:
_AC.................All caps words
_AL.................Alphabetic words
_AN.................Alphanumeric tokens
_CP..................Capitalized words
_LC..................Lowercase words
_NM.................Number
_N1D................1 digit number
_N2D................2 digit number
_N3D................3 digit number
_NLG................Large number


Next




Copyright: USC Information Sciences Institute 2000