Publications
Constructing Flexible, Configurable, ETL Pipelines for the Analysis of" Big Data" with Apache OODT
Abstract
A plethora of open source technologies for manipulating, transforming, querying, and visualizing'big data'have blossomed and matured in the last few years, driven in large part by recognition of the tremendous value that can be derived by leveraging data mining and visualization techniques on large data sets. One facet of many of these tools is that input data must often be prepared into a particular format (eg: JSON, CSV), or loaded into a particular storage technology (eg: HDFS) before analysis can take place. This process, commonly known as Extract-Transform-Load, or ETL, often involves multiple well-defined steps that must be executed in a particular order, and the approach taken for a particular data set is generally sensitive to the quantity and quality of the input data, as well as the structure and complexity of the desired output. When working with very large, heterogeneous, unstructured or semi-structured …
- Date
- January 1, 1970
- Authors
- AF Hart, CA Mattmann, P Ramirez, R Verma, PA Zimdars, S Park, A Estrada, A Sumarlidason, Y Gil, V Ratnakar, D Krum, T Phan, A Meena
- Journal
- AGU Fall Meeting Abstracts
- Volume
- 2013
- Pages
- IN43C-07