Abstract:
Natural language processing (NLP) tools have become ubiquitous for data analysis in digital environments such as the Web and social media. Popular applications include tools for clustering, sequence labeling, machine translation, to name a few. But unfortunately, majority of the existing toolkits rely on supervised learning to train models using labeled data. This poses several challenges---labeled data is not readily available in all languages or domains and building an NLP system from scratch for a new domain (or language, user, etc.) requires significant human effort which is both time-consuming and expensive. Moreover, scaling this strategy on the Web is infeasible.
Recent advances in unsupervised algorithms have demonstrated promising results on several NLP tasks without using any labeled data. But despite their utility, scalable unsupervised algorithms rarely provide probabilistic representations of the data which can be useful for predicting on unseen data or integrated as components with a larger model or pipeline. In addition, these methods often favor simple model descriptions (e.g., k-means algorithm for clustering) in exchange for rich statistical models. This leads to the problem of rapidly diminishing returns when applying these methods on increasing amounts of data. Instead, we need to design algorithms that can scale elegantly to large data as well as complex models.
In this work, I will present our recent work on scalable probabilistic learning with Bayesian inference. We show a novel algorithm for fitting mixtures of exponential families, which generalizes several models that are typically used in NLP and other areas. A major contribution of our work is a novel sampling method that uses locality sensitive hashing to achieve high throughput in generating proposals during sampling. Using "clustering" as an example application, I will describe our approach and show that it scales elegantly to large numbers of clusters achieving a speedup of several orders of magnitude over existing toolkits, while maintaining high clustering quality. In addition, we also prove probabilistic error guarantees for the new sampling algorithm. This is joint work with Amr Ahmed and Alex Smola. Lastly, I will briefly mention some ongoing work on large-scale unsupervised learning for other NLP applications such as machine translation.
Bio:
Sujith Ravi is a Research Scientist at Google. He completed his PhD at University of Southern California/Information Sciences Institute and joined Yahoo! Research, Santa Clara as a Research Scientist before joining Google, Mountain View in 2012. His main research interests span various problems and theory related to the fields of Natural Language Processing (NLP) and Machine Learning. He is specifically interested in large-scale unsupervised and semi-supervised methods and their applications to structured prediction problems in NLP, information extraction, user modeling in social media, graph optimization algorithms for summarizing noisy data, computational decipherment and computational advertising. His work has been reported in several magazines such as New Scientist, ACM TechNews, etc. For more information, you can visit his personal page (http://www.sravi.org).