We propose to create a framework to support biocuration called SciKnowMine (after 'Scientific Knowledge Mine'), cyberinfrastructure that supports biocuration through the automated mining of text, images, and other amenable media at the scale of the entire literature. We will initially address at least one million documents as the primary testing ground, and grow from there. Our current work is centered on supporting the processes of the Mouse Genome Informatics system at Jackson Laboratory and are based on a community effort in collaboration with Larry Hunter, Karin Verspoor, Kevin Cohen and Ellen Rilloff. This project is funded by NSF from 2009-2012 (#0849977).



A long-standing promise of Biomedical Natural Language Processing is to accelerate the process of literature-based ‘biocuration’, where published information must be carefully and appropriately translated into the knowledge architecture of a biomedical database. Typically, biocuration is a manual activity, performed by specialists with expertise in both biomedicine and the computational representation of the target database. The process of biocuration is considered by some to be a vital lynch-pin of biomedical informatics.

Although, we are interested in understanding the biocuration process as a whole, a key (and possibly more tractable) step is the initial triage of documents as being of interest or not (similar to the medical triage process where patients are classified as being of 'immediate interest', i.e. needing quick medical attention). This step sorts the documents so that specialists only need focus on the documents appropriate for them.

The challenge of delivering effective computational support for triage and subsequent curation of large-scale biomedical databases is still unsolved. We are developing tools to accelerate the process of biocuration in-situ for existing biomedical databases (such as the Jackson Laboratory's Mouse Genome Informatics project. We also want to scale up our capability of processing to the level of the whole scientific literature (requiring parallel processing over a document set at the scale of tens of millions of publications).

Figure 1: The high-level vision of SciKnowMine. The system provides infrastructure for two communities: (a) for BioNLP developers to provide components and link them together into scalable-pipelines; and (b) for BioCurators providing a platform for biocuration tools and systems.