Special issue on noisy text analytics

Abstract

Noise is an unavoidable fact of life. It can manifest itself at the earliest stages of processing in the form of degraded inputs that our systems must be prepared to handle. People are adept when it comes to pattern recognition tasks involving typeset or handwritten documents or recorded speech, machines less-so. From the perspective of down-stream processes that take as their inputs the outputs of recognition systems, including document analysis and OCR, noise can be viewed as the errors made by earlier stages of processing, which are rarely perfect and sometimes quite brittle. Noisy unstructured text data is also found in informal settings such as online chat, SMS, email, message board and newsgroup postings, blogs, wikis and web pages. In addition to the aforementioned recognition errors, such text may contain spelling errors, abbreviations, non-standard terminology, missing punctuation, misleading case …

Date: January 1, 1970
Authors: Craig Knoblock, Daniel Lopresti, Shourya Roy, L Venkata Subramaniam
Source: International Journal of Document Analysis and Recognition (IJDAR)
Volume: 10
Pages: 127-128
Publisher: Springer-Verlag