The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText).

How does it work?

The current version of LA-PDFText is a baseline system that extracts text using a three-stage process:

identification of blocks of contiguous text
classification of these blocks into rhetorical categories
extraction of the text from blocks grouped section-wise.

More details of how the program works (combined with performance evaluation) can be found from this publication

