John Heidemann / Papers / Efficient Processing of Streaming Data using Multiple Abstractions

Efficient Processing of Streaming Data using Multiple Abstractions
Abdul Qadeer and John Heidemann

Citation

Abdul Qadeer and John Heidemann. Efficient Processing of Streaming Data using Multiple Abstractions. Proceedings of the IEEE International Conference on Cloud Computing (Virtual, Sep. 2021), 157–167. [DOI] [PDF] [alt PDF]

Abstract

Large websites and distributed systems employ sophisticated analytics to evaluate successes to celebrate and problems to be addressed. As analytics grow, different teams often require different frameworks, with dozens of packages supporting with streaming and batch processing, SQL and no-SQL. Bringing multiple frameworks to bear on a large, changing dataset often create challenges where data transitions—these impedance mismatches can create brittle glue logic and performance problems that consume developer time. We propose Plumb, a meta-framework that can bridge three different abstractions to meet the needs of a large class of applications in a common workflow. Large-block streaming (Block-Streamin) is suitable for single-pass applications that care about the temporal and spatial locality. Windowed-Streaming allows applications to process a group of data and many reductions. Stateful-Streaming enables applications to keep a long-term state and always-on behavior. We show that it is possible to bridge abstractions, with a common, high-level workflow specification, while the system transitions data batch processing and block- and record-level streaming as required. The challenge in bridging abstractions is to minimize latency while allowing applications to select between sequential and parallel operation, while handling out-of-order data delivery, component failures, and providing clear semantics in the face of missing data. We demonstrate these abstractions evaluating a 10-stage workflow of DNS analytics that has been in production use with Plumb for 2 years, comparing to a brittle hand-built system that has run for more than 3 years.

Bibtex Citation

@inproceedings{Qadeer21b,
  author = {Qadeer, Abdul and Heidemann, John},
  title = {Efficient Processing of Streaming Data using Multiple Abstractions},
  booktitle = {Proceedings of the  IEEE International Conference on Cloud Computing},
  year = {2021},
  sortdate = {2021-09-05},
  project = {ant, lacanic, gawseed},
  jsubject = {network_big_data},
  pages = {157--167},
  note = {Special paper award},
  month = sep,
  address = {Virtual},
  publisher = {IEEE},
  jlocation = {johnh: pafile},
  keywords = {big data, hadoop, plumb, DNS, streaming data,
                    data processing, workflow},
  url = {https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.html},
  pdfurl = {https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.pdf},
  doi = {https://doi.org/10.1109/CLOUD53861.2021.00029},
  blogurl = {https://ant.isi.edu/blog/?p=1760}
}
Copyright © by John Heidemann