Here, we focus on using the KEfED approach to provide a data management system (either for an individual laboratory, for a research community or even for a funding agency).
The phrase ‘dark data’ draws a parallel between how the majority of mass in the universe is unobservable (as dark matter or dark energy, Cho 2010) and how the majority of scientific data is inaccessible - locked away in the file cabinets or disk drives of individual researchers (Heidorn 2008). Accessing this data is both technically and sociologically challenging and has been a highly desired outcome from funding agencies for over a decade (Koslow 2000), evidenced by the emergence of many data management systems (Xos et al. 2013) and by the fact that data-sharing plans are required as a condition for funding (Ref 5.). One reason why data generated in laboratories continues to remain ‘dark’ is that tools do not exist that routinely model and capture data at the earliest stage of their lifecycle in the laboratory. An underlying challenge is that existing schema for scientific informatics are designed by systems administrators or computer programmers; they do not necessarily reflect the statistical design of experiments in a way that scientists can themselves understand. The core contribution of our proposed project will address this by providing an open-source, practical data repository that closely matches the underlying data structure generated by each experiment’s protocol (in a way that is also understandable to scientists).
As described, there are many efforts to build ’electronic lab notebook’ (ELN) systems, both in the academic and commercial arenas. This fact, coupled with the observation that there is limited uptake of this technology in general, indicates that the underlying problem remains unsolved. Spreadsheets (e.g., Microsoft Excel) are still the most commonly-used data management solution because this system fits the immediate needs of bench scientists very well and has a shallow learning curve. Excel is, however, disastrous in terms of supporting data standards, computational integration, long term storage, scalability and tracking data provenance. Some Important criteria to address are: (A) Scientists need practical tools that work ‘out of the box’. (B) The technical design of the informatics system should leverage researchers’ statistical expertise. (C) The system should be ontologically-enabled. (D) The system should enable the creation of supplemental data for publication. (E) The system should interoperate with other data repositories. This proposal addresses all of these functions in the context of an influential and highly productive community.
- A Cho. A recipe for the cosmos. Science, 330(6011):1615, 2010
- P. B. Heidorn. Shedding light on the dark data in the long tail of science. Library Trends, 57(2):280-299, 2008.
- S H Koslow. Should the neuroscience community make a paradigm shift to sharing primary data? Nat Neurosci, 3(9):863-865, 2000.
- Xos M Fernndez-Surez and Michael Y Galperin. The 2013 nucleic acids research database issue and the online molecular biology database collection. Nucleic acids research, 41(Database issue):D1-7, January 2013. PMID: 23203983. http://dx.doi.org/10.1093/nar/gks1297.
- NIH data sharing policy. http://grants.nih.gov/grants/policy/data_sharing/.