In today’s information society, we are soaked with overwhelming amounts of natural language text data (e.g., news, social media posts, and research papers). A grand challenge for text mining and natural language processing (NLP) researchers is to develop effective and scalable methods to digest such massive unstructured text corpora and turn them into structures, from which actionable knowledge can be generated based on user’s need. My research focuses on minimizing the human effort in structuring massive text data while retaining high-quality results. In this talk, we will go through a series of weakly supervised and unsupervised text mining methods with a focus on phrase mining and document classification. Specifically, we will try to answer the following two questions: (1) Can we extract domain-specific, emerging, infrequent phrases from massive text data only without any human annotation? (2) Can we classify a large collection of documents with the natural-language class names only?

Jingbo Shang is an Assistant Professor at the Computer Science and Engineering Department and Halicioglu Data Science Institute at the University of California, San Diego. He obtained his Ph.D. from the University of Illinois at Urbana-Champaign in 2019. He received his B.E. from Shanghai Jiao Tong University in 2014. His research focuses on data mining, natural language processing, and machine learning methods with minimum human effort and their applications. His research has been recognized by many prestigious awards, including the Grand Prize of Yelp Dataset Challenge in 2015, Google Ph.D. Fellowship in Structured Data and Database Management in 2017, SIGKDD Dissertation Award Runner-up in 2020, and Google Research Scholar in 2021.

Host: Muhao Chen, POC: Pete Zamar

