The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText).
How does it work?
The current version of LA-PDFText is a baseline system that extracts text using a three-stage process:
identification of blocks of contiguous text
classification of these blocks into rhetorical categories
extraction of the text from blocks grouped section-wise.
More details of how the program works (combined with performance evaluation) can be found from this publication
If you use LA-PDFText in your project, please cite us as follows:
Ramakrishnan, C., A. Patnia, E. Hovy and G. Burns (2012). "Layout-Aware Text Extraction from Full-text PDF of Scientific Articles." Source Code for Biology and Medicine 7(1): 7. [http://www.scfbm.org/content/7/1/7/abstract]