Constructing Flexible, Configurable, ETL Pipelines for the Analysis of" Big Data" with Apache OODT

Abstract

A plethora of open source technologies for manipulating, transforming, querying, and visualizing'big data'have blossomed and matured in the last few years, driven in large part by recognition of the tremendous value that can be derived by leveraging data mining and visualization techniques on large data sets. One facet of many of these tools is that input data must often be prepared into a particular format (eg: JSON, CSV), or loaded into a particular storage technology (eg: HDFS) before analysis can take place. This process, commonly known as Extract-Transform-Load, or ETL, often involves multiple well-defined steps that must be executed in a particular order, and the approach taken for a particular data set is generally sensitive to the quantity and quality of the input data, as well as the structure and complexity of the desired output. When working with very large, heterogeneous, unstructured or semi-structured …

Date: January 1, 1970
Authors: AF Hart, CA Mattmann, P Ramirez, R Verma, PA Zimdars, S Park, A Estrada, A Sumarlidason, Y Gil, V Ratnakar, D Krum, T Phan, A Meena
Journal: AGU Fall Meeting Abstracts
Volume: 2013
Pages: IN43C-07

Information Sciences Institute

Publications

Constructing Flexible, Configurable, ETL Pipelines for the Analysis of" Big Data" with Apache OODT

Abstract