Publications

Parallel Clustering of High-Dimensional Social Media Data Streams

Abstract

We introduce Cloud DIKW (Data, Information, Knowledge, Wisdom) as an analysis environment supporting scientific discovery through integrated parallel batch and streaming processing, and apply it to one representative domain application: social media data stream clustering. In this context, recent work demonstrated that high-quality clusters can be generated by representing the data points using high-dimensional vectors that reflect textual content and social network information. However, due to the high cost of similarity computation, sequential implementations of even single-pass algorithms cannot keep up with the speed of real-world streams. This paper presents our efforts in meeting the constraints of realtimesocial media stream clustering through parallelization in Cloud DIKW. Specifically, we focus on two system-level issues. Firstly, most stream processing engines such as Apache Storm organize …

Date
October 18, 2025
Authors
Xiaoming Gao, Emilio Ferrara, Judy Qiu
Conference
CCGrid 2015: 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Pages
323-332
Publisher
IEEE/ACM