Biomedical Knowledge Engineering Group



BioScholar is a Knowledge Engineering and Management system to support a single scientific worker (at the level of a graduate student or postdoctoral worker) to design, construct and manage a shared knowledge repository for a research group by curating and processing knowledge from the biomedical scientific literature. In particular, we use BioScholar as the platform for our work in the DARPA Big Mechanisms program.

The first sentence to appear in the journal `Trends in Neuroscience’ was “Even the most active neuroscientist spends more working hours in reading, reviewing, and writing scientific reports than on direct experimental effort” (Bloom 1978). Seventeen years later, the same author reaffirmed his previous claim and anticipated the development of knowledge management systems to address the problem (Bloom 1995). At this present time, a further eighteen years later, only a handful of neuroinformatics systems provide support for neuroscience knowledge supplied from the literature and none of them provide widespread, practical support beyond the process of simple citation management (such as Endnote, Mendeley, Zotero, Papers, etc.). The primary challenge for developers of these systems is the complexity and heterogeneity of experimental neuroscience mechanisms that directly incorporate detailed anatomical and physiological knowledge at the widespread spatial (molecular to behavioral) and temporal (electrochemical to evolutionary) scales.  Neuroinformatics systems derived from the literature are typically one-off systems developed by neuroscientists to manage a specific domain specific challenge problem. The systems thus built tend to be (a) non-portable, (b) manually-curated, (c) lacking a detailed provenance trail and (d) small.

In this project, we build on extensive previous work in this field tol develop knowledge engineering support for graduate students for capturing the data of papers that they read. This will focus on supporting the activity of Journal Clubs within the graduate student population.  The uptake of our software by users in the field will be determined by high-quality user feedback to guide the process of development directly and so our primary point of contact with students is crucial. Journal clubs provide a well-structured framework to (a) train students in good scholarly practice, (b) provide community-based support for the process of studying the literature and (c) teach critical evaluation of other people’s work. We see it as an ideal, pre-existing context for the careful introduction of a new system into students’ repertoire of knowledge management tools. Each meeting will involve a single student presenting an experimental paper of interest to the relevant graduate student community who we will work with to construct a populated knowledge engineering model for the paper they wish to present. 

  1. Bloom, F.E. (1978). New solutions for science communication problems needed now. Trends in Neurosciences 1
  2. Bloom, F.E. (1995). Neuroscience-knowledge management: slow change so far. Trends Neurosci 18, 48–49.
Knowledge Engineering Enabled Neuroscience (KEEN)

We seek to provide a service for experimental neuroscientists by providing made-to-order, modular, principled, open-source knowledge engineering neuroinformatics software based on an agile, rapid-deployment model. 

Experimental neuroscientists generate data over a wide range of modalities, under complex designs and for a wide range of purposes. Developing knowledge engineering tools to enable management, analysis and publishing of this data is challenging for a neuroscientist PI running a lab since (a) interpreting the data requires specialized neuroscience knowledge and (b) developing these tools require expert software engineering work. Thus, even though neuroscience PIs may want (and even need) neuroinformatics tools to showcase, demonstrate, and analyse their data, the process of developing them has, until now, proven prohibitive and impractical. External work-for-hire developers lack the required neuroscience expertise. Off-the-shelf solutions typically do not exist for the niche-requirements of individual labs and open-source implementations are likely to have been built by graduate students and are unlikely to work well outside of their originating laboratories.

This work leverages a code-generation methodology to construct 'scaffolding' for a complete web-application based on models of the desired functionality that we develop in close collaboration with scientists. This permits us to build prototypes which can be practically deployed rapidly and further tailored to the needs of the sponsoring laboratory. Our goal is to make the process of developing these tools as rapid, straightforward and predictable as possible. We foster best code development practices (unit testing, user stories, documentation, advanced dependency management, one-click installable deployment, evaluation based on maturity models etc.). All code will be published on Github for easy reuse and adoption by the broader neuroinformatics community.

In addition, beyond the initial support of building simple data management and analysis and we can provide additional support to leverage cutting edge AI-driven research in collaboration with colleagues at ISI. This includes: (A) text mining full-text articles of interest; (B) application of reproducible workflow analysis technology; (C) integration across existing data sources; (D) use of reasoning systems, ontologies and semantic web technology.

This effort will be lead by Gully Burns ( As a professional neuroinformatics specialist (D.Phil. from Oxford, 1997), Gully has over 15 years of experience developing software tools for neuroscientists and sees this a novel strategy to increase the effectiveness and impact of neuroinformatics systems in supporting bench scientists in their research.

Here we showcase the NeuARt system developed in collaboration with Larry Swanson's group at USC to manage and make available complex neural connectivity maps that are painstakingly assembled in the Swanson lab. This is an example of the sort of systems that we build for use by bench scientists.

If you are a neuroscientist with data that you need to convert into a well-defined neuroinformatics application for use in your lab or the broader community, we are available to work as subcontractors on grants. We also provide access to advanced functionality based on our interactions with colleagues in the AI community. Email Gully Burns for more details. 

KEfED + OoEVV Technology

The Knowledge Engineering from Experimental Design methodology provides a concrete general computational approach to modeling, capturing and publishing research findings based on observations. Here we describe preliminary implementations, the underlying theoretical rationale and validation studies in multiple domains.  

The selection, definition and use of experimental variables is possibly the closest that a bench scientist comes to working with ontological concepts directly. When working with their own data, they understand the importance of standardizing their vocabulary, of defining exactly what they are measuring and how they measured it. Here, we describe work that empowers experimental scientists to define the experimental variables that they are using in a simple, bridging ontological framework (expressed as an ‘ontology design pattern’, ODP) that can then make those definitions available as ontologically defined terms. We emphasize a minimal ontological commitment and tool building that uses widely-used data-entry software (Microsoft Excel) to promote understandability and ease of use. 

We also incorprate mechanisms for interoperability with other ontologies and terminologies such as EFO, OBI, the NINDS Common Data Elements (CDE), and efforts like dbGap (scientist-driven repositories of variable definitions). As the Knowledge Engineering Working Group of the Biomedical Informatics Research Network (BIRN), we provide terminology support for the mediation technology development in several domains (neuroimaging, NHP HIV Vaccine development, immunology, radiation oncology, etc).  

Figure 1: The underlying premise of the KEfED model

Knowledge Engineering from Experimental Design ('KEfED')

KEfED is a knowledge representation of ‘experimental observational assertions’, based on the statistical relations between variables. KEfED elements (see Fig. 1) allow a curator to build data structures based on the dependencies between parameters, constants and measurements that can be derived from a flow diagram of an experimental protocol. Each measurement is indexed by parameters and constants by tracing a path through the protocol back to its starting point, and any parameter or constant falling on this path is used as an index for the measurement. This simple idea provides the motivating need to develop and lightweight, expressive standard terminology of elements to be used in these models.

Figure 2: UML class diagram of basic structure of OoEVV showing detailed representation of the specification of measurement scales

Ontology of Experimental Variables and Values (OoEVV)

In an illustrative example, human subjects with or without schizophrenia participated in an functional Magnetic Resonance Imaging (fMRI) scan while performing a particular auditory oddball task (Ford et al. 2009, Schizophren. Bull. 35:58-66). The variables in this experiment include subject characteristics (diagnostic group, age, gender, performance on the task), as well as the experimental conditions of the oddball task (listening to the oddball or standard stimulus), and variations in the data collection methods (which fMRI scanner was used). Each variable is defined with its own mathematical characteristics for this study: ‘Age’ and the ‘BOLD signal’ are continuous numeric variables. ‘Diagnostic category’ or ‘gender’, have no units and cannot be added or subtracted meaningfully. OoEVV captures this usually implicit information.

The basic components of OoEVV are shown in Fig. 2 as a UML class diagram. An OoEVVElementSet instance denotes a collection containing all variables relevant to a given domain, such as fMRI. An ExperimentalVariable instance measures a ‘quality’ (a Term instance denoting a reference to the external characteristic within the world that the variable measures). In our example, ‘age in years’ and ‘experimental condition’ are two example variables so that the ‘age in years’ variable measures the age of the subject at the time of the experiment in years, which could be linked to the relevant term from the Phenotype, Attribute, and Trait Ontology (PATO, PATO:0000011). The ‘experimental condition’ variable indicates whether the data were collected during the ‘oddball’ or ‘standard tone’ conditions of the auditory oddball task, and links to the Cognitive Paradigm Ontology (CogPO, CogPOver1:COGPO_00110).

Each variable links to a MeasurementScale instance that delimits the types of computation that may be performed on a given variable and the range of possible values for a variable. The ‘age in years’ variable uses a IntegerScale (a specialization of NumericScale), while the ‘experimental condition’ uses a NominalScale (denoting values that may only be compared to see if they are same). Other scale types also include OrdinalScale (denoting values that may only be ranked), BinaryScale (denoting variables that take only ‘true’ or ‘false’ values), RelativeScale (denoting values that take can only defined by their relation to other objects), and HierarchicalScale (with values organized in a hierarchical structure, such as organismal taxonomy). Since OoEVV is only a specification for experimental variable definitions, we use MeasurementValue instances to assist with the specification of each MeasurementScale rather than representing data (at this stage).

It is crucial to note that this formulation allows us to define multiple variables that measure the same underlying quality with different mathematical scales. Our latest paper presents a worked example of OoEVV curation for a single fMRI study (

Figure 3: Screencap of the current BioScholar KEfED editor system 

KEfED Curation

Fig. 3 shows ‘BioScholar’, a KEfED-enabled curation tool. This allows an researcher to draw a protocol in a graphical interface showing entities, processes and variables (constants, parameters and measurements) within an experiment. The system automatically builds data tables from the protocol design that could be used as the basis for a data repository. We have developed OoEVV to provide definitions of these elements as an ontology that can also support links to related terms in formal ontologies.

Figure 4: Organization of commands, components and data flow in the current KefedAdmin system. 

OoEVV Tools and Curation

A goal of OoEVV is to provide a framework that domain experts can easily use. Fig. 4 shows the functional organization of a command-line application that uses spreadsheets to curate terminology (using standard file-sharing tools such as DropBox, Google Docs or Subversion to manage the files). Each separate Excel workbook corresponds to a separate OoEVVElementSet. This permits us to provide detailed examples and instructions for handling exception cases in a way that we may adjust as the project progresses. The user can create a formatted spreadsheet (generateOoevvSpreadsheet) that may be filled out according to our curation manual (see The user may  add the contents of this file to an OWL file (permitting users to run a command (ooevvSpreadsheetToOwl) repeatedly over a set of spreadsheets to build an extended representation). A user may aggregate multiple spread- sheets into a MySQL database (ooevvDirToDatabase / ooevvSpreadsheetToDatabase) which then may be examined in a web-viewer application (Fig. 5). This example shows an antibody (typically used as a parameter in an experiment), and links to the EFO definition of an antibody. Finally, to provide a centralized set of definitions, a curator may run the databaseToOwl function that generates an OWL file to check that the model generated by the process is classifiable. This file may then be uploaded to the National Center of Biomedical Ontology’s bioportal system to provide a centralized, versioned representation of OoEVV (

Figure 5: OoEVV Viewer Prototype.

BIRN Applications and Users

A primary capability of our work within BIRN is to provide a simple methodology for us to construct ontologies for end-users that are appropriate for their needs. Given the large overhead incurred by building ontologies in various domains, we developed OoEVV to identify sets of sub-elements needed for their experimental work. Within BIRN, this was typically based on support of the BIRN mediator system (Ashish et al., 2010, Front. Neuroinform. 4:118). As an ODP, we anticipate that OoEVV tools may be used as a support system for other ontologies as our implementation improves. We currently are focussed on supporting numerous experimental domains including (a) neuroanatomical tract tracing experiments, (b) fMRI, (c) genetic childhood neurodevelopmental disorders, (d) radiation oncology studies, (e) stroke studies, (f) drug infusion studies, and (g) vaccine protection studies. Our development work within BIRN focusses on ‘capabilities’: 

Related Work and Discussion

We have developed a lightweight system, appropriate to our needs but consistent with and interoperable with other efforts such as OBI and EFO. Some features of our representation are original and potentially important. The definition of multiple variables that measure the same underlying quality in different ways allows us to model accurately how different experimentalists gather data. Our detailed representation of different measurement scales provides an extensible ‘type system‘ for each variable. We use UML as the base for our represention, restricting our use of OWL-based formal reasoning but enabling easier contruction of tools and curation processes. This is consistent with our focus on ODPs and tool construction but should be viewed as complementary and supportive of a more formal approach. Our OWL ontology is currently very simple, and future work will center on linking this to OBI and other efforts, further developing it’s use in KEfED and other contexts in BIRN and information integration applications.   The work that we describe here is supported by open-source tools available via


This work was supported by NIH with FBIRN (RR021992); Biomedical Informatics Research Network (RR025736); CogPO (MH084812); and BioScholar (GM083871).  We thank Tom Russ, Swati Raina and Karthik Narasandra Manju-natha, Jose Luis Ambite, Maria Muslea, Naveen Ashish, Alex Paciorski, Ona Wu, and Vitali Moiseenko.

This page was derived from the poster presented at the ISMB 2012 Bio-ontologies meeting in Long Beach California.  


The Portable Document Format (PDF) is the almost universally used file format for online scientific publications. It is also notoriously difficult to read and handle computationally, presenting challenges for developers of biomedical text mining or biocuration informatics systems that use the published literature as an information source. To facilitate the effective use of scientific literature in such systems we introduce Layout-Aware PDF Text Extraction (LA-PDFText).

How does it work?

The current version of LA-PDFText is a baseline system that extracts text using a three-stage process:

identification of blocks of contiguous text
classification of these blocks into rhetorical categories
extraction of the text from blocks grouped section-wise.

More details of how the program works (combined with performance evaluation) can be found from this publication

Citing LA-PDFText

If you use LA-PDFText in your project, please cite us as follows:

Ramakrishnan, C., A. Patnia, E. Hovy and G. Burns (2012). "Layout-Aware Text Extraction from Full-text PDF of Scientific Articles." Source Code for Biology and Medicine 7(1): 7. []


Here, we focus on using the KEfED approach to provide a data management system (either for an individual laboratory, for a research community or even for a funding agency). 

The phrase ‘dark data’ draws a parallel between how the majority of mass in the universe is unobservable (as dark matter or dark energy, Cho 2010) and how the majority of scientific data is inaccessible - locked away in the file cabinets or disk drives of individual researchers (Heidorn 2008). Accessing this data is both technically and sociologically challenging and has been a highly desired outcome from funding agencies for over a decade (Koslow 2000), evidenced by the emergence of many data management systems (Xos et al. 2013) and by the fact that data-sharing plans are required as a condition for funding (Ref 5.). One reason why data generated in laboratories continues to remain ‘dark’ is that tools do not exist that routinely model and capture data at the earliest stage of their lifecycle in the laboratory. An underlying challenge is that existing schema for scientific informatics are designed by systems administrators or computer programmers; they do not necessarily reflect the statistical design of experiments in a way that scientists can themselves understand. The core contribution of our proposed project will address this by providing an open-source, practical data repository that closely matches the underlying data structure generated by each experiment’s protocol (in a way that is also understandable to scientists).

As described, there are many efforts to build ’electronic lab notebook’ (ELN) systems, both in the academic and commercial arenas. This fact, coupled with the observation that there is limited uptake of this technology in general, indicates that the underlying problem remains unsolved. Spreadsheets (e.g., Microsoft Excel) are still the most commonly-used data management solution because this system fits the immediate needs of bench scientists very well and has a shallow learning curve. Excel is, however, disastrous in terms of supporting data standards, computational integration, long term storage, scalability and tracking data provenance. Some Important criteria to address are: (A) Scientists need practical tools that work ‘out of the box’. (B) The technical design of the informatics system should leverage researchers’ statistical expertise. (C) The system should be ontologically-enabled. (D) The system should enable the creation of supplemental data for publication. (E) The system should interoperate with other data repositories. This proposal addresses all of these functions in the context of an influential and highly productive community.

  1. A Cho. A recipe for the cosmos. Science, 330(6011):1615, 2010 
  2. P. B. Heidorn. Shedding light on the dark data in the long tail of science. Library Trends, 57(2):280-299, 2008. 
  3. S H Koslow. Should the neuroscience community make a paradigm shift to sharing primary data? Nat Neurosci, 3(9):863-865, 2000. 
  4. Xos M Fernndez-Surez and Michael Y Galperin. The 2013 nucleic acids research database issue and the online molecular biology database collection. Nucleic acids research, 41(Database issue):D1-7, January 2013. PMID: 23203983.
  5. NIH data sharing policy.

We propose to create a framework to support biocuration called SciKnowMine (after 'Scientific Knowledge Mine'), cyberinfrastructure that supports biocuration through the automated mining of text, images, and other amenable media at the scale of the entire literature. We will initially address at least one million documents as the primary testing ground, and grow from there. Our current work is centered on supporting the processes of the Mouse Genome Informatics system at Jackson Laboratory and are based on a community effort in collaboration with Larry Hunter, Karin Verspoor, Kevin Cohen and Ellen Rilloff. This project is funded by NSF from 2009-2012 (#0849977).



A long-standing promise of Biomedical Natural Language Processing is to accelerate the process of literature-based ‘biocuration’, where published information must be carefully and appropriately translated into the knowledge architecture of a biomedical database. Typically, biocuration is a manual activity, performed by specialists with expertise in both biomedicine and the computational representation of the target database. The process of biocuration is considered by some to be a vital lynch-pin of biomedical informatics.

Although, we are interested in understanding the biocuration process as a whole, a key (and possibly more tractable) step is the initial triage of documents as being of interest or not (similar to the medical triage process where patients are classified as being of 'immediate interest', i.e. needing quick medical attention). This step sorts the documents so that specialists only need focus on the documents appropriate for them.

The challenge of delivering effective computational support for triage and subsequent curation of large-scale biomedical databases is still unsolved. We are developing tools to accelerate the process of biocuration in-situ for existing biomedical databases (such as the Jackson Laboratory's Mouse Genome Informatics project. We also want to scale up our capability of processing to the level of the whole scientific literature (requiring parallel processing over a document set at the scale of tens of millions of publications).

Figure 1: The high-level vision of SciKnowMine. The system provides infrastructure for two communities: (a) for BioNLP developers to provide components and link them together into scalable-pipelines; and (b) for BioCurators providing a platform for biocuration tools and systems. 


The VPDMf stands for 'View Primitive Data Model framework' and is a scaffolding framework for developing knowledge-engineering systems based on a simple frame-oriented conceptual design. The system uses a encapsulation mechanism to capture schema elements into views in the design phase of developing a system. It is implemented in Java and uses a simplified representation of UML class diagrams to provide the basic architecture for the system. From this base, it is possible to generate source code for a variety of elements that are useful for biomedical knowledge engineering systems. 

Figure 1: The basic scaffolding system provided by the VPDMf

The organization of the system for constructing forward-engineered applications is shown in Figure 1. This illustrates how the VPDMf can rapidly generate a set of synchronized software elements that are archived together in a single file, and then used to generate a functional system. Executing VPDMf commands to build a system will (a) execute scripts to construct a MySQL database and populate it with preliminary data if specified, (b) generate and deploy Java Persistence Entity classes and (c) generate and deploy ActionScript classes for use within a Flex-based client. Design files are deployed under the Maven system (which specifies models using a ‘GAV’ code (standing for ‘Group/Artifact/Version’, e.g., ‘edu.isi.bmkeg’ / ‘ooevv’ / ‘0.0.7-SNAPSHOT’ denotes the latest version of the OoEVV project’s model). Because this process generates source code that must be compiled into target systems as dependencies, any changes in the underlying design that break downstream components cause compile errors that are easy to detect and fix. This embodies the underlying pragmatic philosophy of our approach: we base our work on existing low-level coding tools and practices. For example, using UML means that we can easily export representations to object-oriented languages, and using Maven permits us to deploy our systems easily as component libraries available from public repositories.