Publications

An automatic approach to semantic annotation of unstructured, ungrammatical sources: A first look

Abstract

There exist numerous sources of data on the World Wide Web that contain useful information but are not structured or grammatical enough to support traditional information extraction. Furthermore, even if the information extraction could be done, the extracted values would need to be standardized to ensure the queries over the source are accurate. This paper presents an automatic, scalable approach to semantically annotating such unstructured, ungrammatical sources with standardized values, allowing for accurate, structured queries of the source. Our technique recasts the information extraction problem as an information retrieval problem, treating each entry of unstructured, ungrammatical text as a query and comparing it to a set of known records called a “reference set.” Furthermore, given a library of reference sets, the system automatically chooses the correct one, making the technique fully unsupervised. We compare our automatic technique to a previous approach that exploits supervised learning and show that we get comparable results. In the previous approach, beyond providing labeled training data, a user also must supply the “reference set” which is exploited for the extraction and standardization of the values.

Date
September 22, 2025
Authors
Matthew Michelson, Craig A Knoblock
Journal
IJCAI’07 Workshop on Analytics for Noisy Unstructured Text Data
Pages
123-130