Publications
BDQC: a general-purpose analytics tool for domain-blind validation of Big Data
Abstract
Translational biomedical research is generating exponentially more data: thousands of whole-genome sequences (WGS) are now available; brain data are doubling every two years. Analyses of Big Data, including imaging, genomic, phenotypic, and clinical data, present qualitatively new challenges as well as opportunities. Among the challenges is a proliferation in ways analyses can fail, due largely to the increasing length and complexity of processing pipelines. Anomalies in input data, runtime resource exhaustion or node failure in a distributed computation can all cause pipeline hiccups that are not necessarily obvious in the output. Flaws that can taint results may persist undetected in complex pipelines, a danger amplified by the fact that research is often concurrent with the development of the software on which it depends. On the positive side, the huge sample sizes increase statistical power, which in turn can shed new insight and motivate innovative analytic approaches. We have developed a framework for Big Data Quality Control (BDQC) including an extensible set of heuristic and statistical analyses that identify deviations in data without regard to its meaning (domain-blind analyses). BDQC takes advantage of large sample sizes to classify the samples, estimate distributions and identify outliers. Such outliers may be symptoms of technology failure (e.g., truncated output of one step of a pipeline for a single genome) or may reveal unsuspected “ signal” in the data (e.g., evidence of aneuploidy in a genome). We have applied the framework to validate real-world WGS analysis pipelines. BDQC successfully identified data outliers …
- Date
- 2018
- Authors
- Eric W Deutsch, Roger Kramer, Joseph Ames, Andrew Bauman, David S Campbell, Kyle Chard, Kristi Clark, Mike D’Arcy, Ivo D Dinov, Rory Donovan, Ian Foster, Benjamin D Heavner, Leroy E Hood, Carl Kesselman, Ravi Madduri, Huaiyu Mi, Anushya Muruganujan, Judy Pa, Nathan D Price, Max Robinson, Farshid Sepehrband, Arthur W Toga, John Van Horn, Lu Zhao, Gustavo Glusman
- Journal
- bioRxiv
- Pages
- 258822
- Publisher
- Cold Spring Harbor Laboratory