The 'Building a Breakthrough Machine' Research Initiative
Thomas Kuhn introduced the concept of a paradigm as ’coherent traditions of scientific work’ comprising laws, theories, applications and instrumentation that largely remain static and stable but occasionally undergo massive upheaval as part of the natural process of scientific discovery [Kuhn 1962]. These paradigms provide well-defined frameworks for scientists to work within; until the explanations provided by the existing paradigm no longer match the data being gathered. It is at this point that the accepted approach must undergo a change (called a ‘paradigm-shift’) and it is at these sudden, abrupt transitions of the scientific process that great discoveries can be made. Within this research initiative, our goal is to build AI systems that are capable of facilitating and accelerating this process of discovery in biomedicine.
Unlike other efforts to encapsulate scientific discovery as a challenge problem for computer scientists to solve algorithmically [Langley 1987], we will address this issue as a series of biomedical informatics challenge projects under a broadly defined framework that is itself an open ended research program. The large-scale design of the initiative is guided by the following high-level schematic (Figure 1).
Figure 1: A unifying model of the cycles of scientific investigation, available knowledge sources and computational approaches needed to populate and drive the process of passing knowledge around the cycle to enable discovery.
We focus on two domains of discourse: one theoretical, domain-specific and interpretive (the ‘Interpretations’ on the right hand side of Figure 1), the other experimental, based on statistical elements (and therefore domain-independent) and observational (the ‘Observations’ on the left hand side of Figure 1). Our methodology is geared towards viewing the scientific process as cycling over these two domains to move from (a) reasoning within a specific subject to generate a testable hypothesis or frame a research question; (b) designing an experiment to test the hypothesis (or address the question); (c) executing the experiment to allow a researcher to make structured ‘observational assertions’ from experimental data; (d) these individual outcomes of experiments may be processed and aggregated to generate domain-specific ‘interpretive assertions’ (often referred to as ‘facts’, ‘statements’ or ‘claims’); (e) these assertions are themselves aggregated to contribute back to the collection of domain knowledge.
An almost-identical high-level view has been recently echoed by other leading researchers attempting to construct systems that can reason scientifically (see Figure 1 from [Soldatova 2011] or Figure 2 from [Clark 2007]). Our formulation of this framework has been published as a ‘highly accessed’ paper in BMC Bioinformatics combined with preliminary tools to deliver this to end users [Russ 2011].
Figure 1 also illustrates our strategy for acquiring knowledge for use within this framework. Broadly speaking, there are three ways to access biomedical information: (a) from the scientific literature, (b) from online databases, and (c) from ’raw data’ (generated directly from the bench or shared between groups through computational sharing mechanisms). This directly frames the immediate computer science challenges that we must contend with.
The process of converting information from published papers into database entries is referred to as ’biocuration’. This activity is the lynchpin of much of biomedical informatics in general [Bourne 2006], and is typically challenging, expensive and time consuming. This provides opportunities for research and systems development for the Natural Language Group, specifically within the field of ’Machine Reading’ [Etzioni 2006, Mulkar 2007] designed specifically to assist biocuration.
Online scientific information is widely available. The ‘Database Issue’ of the Journal ’Nucleic Acids Research’ is a catalogue of actively-maintained molecular biology databases. The 2010 issue lists 1230 database systems [Cochrane 2010]. This is probably an underestimate of the total number of biomed- ical informatics databases. This presents opportunities for the Information Integration Group, where each separate database provides a source schema which may be integrated and synthesized into customized representations.
‘Raw data’ can occurs in a great many forms and laboratory scientists probably use them all. They typically use whatever systems are immediately available and convenient for them and store data in files (MS Office documents, analysis programs, etc.) or lightweight databases. There are opportunities for knowledge representation research tailored to biomedical data [Burns 2009] and for Grid-Computing for establishing and supporting data sharing in virtual communities (including file sharing, metadata catalogues and web-portals).
Executing the processes of converting between the various stages of the cycle may be addressed with Workflow systems or with Knowledge Representation and Reasoning methods. Semantic web technology (ontologies, open-linked data) can serve as a mechanism to share ‘standard’ representations and to link to collaborators and to the general research community.
Two additional activities contribute to this initiative:
- The Interactive Knowledge Capture group are specifically concerned with Scientific Workflows may be applied to automate data processing and enable large scale data analysis to be reproduced (a serious problem currently facing computational researchers).
- Work by Pedro Szekely on human-user interaction are particularly important to biomedical scientists, since biomedical scientist end-users are typically very busy, driven by deadlines and have little tolerance for poorly constructed interfaces and tools. It is crucial that the systems we build are well-constructed and enable scientists to focus on the process of discovery, rather than the mundane issues of how to interact with their computational tools.
Biomedical informatics has always been a deeply collaborative discipline, requiring participation from biomedical and computational experts. We must recognize that our bioinformatics work will always be driven in service of requirements determined by either biomedical scientists or bioinformatics experts. Therefore, the development of intuitive user interfaces and data visualization systems is a very important requirement for all the systems that we build.
[Ashish 2010] Ashish, N., Ambite, J. L., Muslea, M., and Turner, J. a. (2010). Neuroscience Data Integration through Mediation: An (F)BIRN Case Study. Frontiers in neuroinformat- ics, 4(December):118.
[Bourne 2006] Bourne, P. E. and McEntyre, J. (2006). Biocurators: contributors to the world of science. PLoS Comput Biol, 2(10):e142. 1553-7358 (Electronic) Journal Article.
[Burghes 2009] Burghes, A. H. M. and Beattie, C. E. (2009). Spinal muscular atrophy: why do low levels of survival motor neuron protein make motor neurons sick? Nature reviews. Neuroscience, 10(8):597–609.
[Burns 2009] Burns, G. and Russ, T. (2009). Biomedical knowledge engineering tools based on experimental design: a case study based on neuroanatomical tract-tracing experiments. In KCAP 2009, Long Beach CA.
[Chalupsky 2010] Chalupsky, H., MacGregor, R. M., and Russ, T. (2010). PowerLoom Manual. USC Information Sciences Institute, ver 1.48 edition. http://www.isi.edu/isd/LOOM/PowerLoom/documentation/manual/manual.pdf.
[Clark 2007] Clark, T. and Kinoshita, J. (2007). Alzforum and swan: the present and future of scientific web communities. Brief Bioinform, 8(3):163–71.
[Cochrane 2010] Cochrane, G. R. and Galperin, M. Y. (2010). The 2010 nucleic acids research database issue and online database collection: a community of data resources. Nucleic Acids Res, 38(Database issue):D1–4.
[Etzioni 2006] Etzioni, O., Banko, M., and Cafarella, M. J. (2006). Machine reading.
[Galperin 2011] Galperin, M. Y. and Fernandez-Suarez, X. M. (2011). The 2012 Nucleic Acids Re- search Database Issue and the online Molecular Biology Database Collection. Nucleic acids research, 40(December 2011):1–8.
[Gaudet 2011] Gaudet, P., Bairoch, A., Field, D., Sansone, S.-A., Taylor, C., Attwood, T. K., Bate- man, A., Blake, J. a., Bult, C. J., Cherry, J. M., Chisholm, R. L., Cochrane, G., Cook, C. E., Eppig, J. T., Galperin, M. Y., Gentleman, R., Goble, C. a., Gojobori, T., Hancock, J. M., Howe, D. G., Imanishi, T., Kelso, J., Landsman, D., Lewis, S. E., Mizrachi, I. K., Orchard, S., Ouellette, B. F. F., Ranganathan, S., Richardson, L., Rocca-Serra, P., Schofield, P. N., Smedley, D., Southan, C., Tan, T. W., Tatusova, T., Whetzel, P. L., White, O., and Yamasaki, C. (2011). Towards BioDBcore: a community-defined information specification for biological databases. Nucleic acids research, 39(Database issue):D7–10.
[Ives 2008] Ives, Z. G., Knoblock, C. A., Minton, S., Jacob, M., Pratim, P., Rattapoom, T., Luis, J., Maria, A., and Cenk, M. (2008). Interactive Data Integration through Smart Copy & Paste. In International Conference on Applications of Declarative Programming and Knowledge Management, Evora, Portugal.
[Kuhn 1962] Kuhn, T. (1962). The Structure of Scientific Revolutions. University of Chicago Press, Chicago, IL.
[Langley 1987] Langley, Simon, B. and Zytkow (1987). Scientific Discovery, Computational Explorations of the Creative Process. MIT Press, Boston, MA.
[Mulkar 2007] Mulkar, R., Hobbs, J. R., Hovy, E., Chalupsky, H., and Lin, C.-Y. (2007). Learning by reading: Two experiments. In 3rd International Workshop on Knowledge and Reasoning for Question Answering, Hyderabad, India.
[Russ 2011] Russ, T., Ramakrishnan, C., Hovy, E., Bota, M., and Burns, G. (2011). Knowledge Engineering Tools for Reasoning with Scientific Observations and Interpretations: a Neural Con- nectivity Use Case. BMC Bioinformatics, 12(1):351.
[Saper 2010] Saper, C. B., Fuller, P. M., Pedersen, N. P., Lu, J., and Scammell, T. E. (2010). Sleep state switching. Neuron, 68(6):1023–42.
[Soldatova 2011] Soldatova, L. N. and Rzhetsky, A. (2011). Representation of research hypotheses. Journal of biomedical semantics, 2 Suppl 2(Suppl 2):S9.