What is Information Extraction

Abstract

(A lot of this comes from [2]. A lot also comes from Heng Ji’s (UIUC) IE class slides.) By 1994 the US government was already familiar with information retrieval: search a corpus of documents and retrieve those documents that match your search terms–now you don’t have to read so many documents! But documents can be long; how about we search for just the facts we want instead? Originally, there were templates of all the info the gov’t wanted to find (eg locations and actions of ships scraped from navy telegraph cables). This was tested in a series of evaluations (this is how evaluations to drive NLP research got going!). Originally, you had to participate in the entire pipeline of various kinds of information retrieval but then (1995) they split into independent and more generic tasks to encourage more participation by smaller teams. One of the tasks that year was ‘Named Entity Recognition.’This eventually led to the ‘Automatic Content Extraction’program (ACE) which focused on even more fine-grained, independent tasks. The corpora produced in 2005 by ACE are still used, nearly 20 years later.

Date: February 11, 2010
Authors: Jonathan May

View Paper