Publications
Reducing Label Cost by Combining Feature Labels and Crowdsourcing
Abstract
Decreasing technology costs, increasing computational power and ubiquitous network connectivity are contributing to an unprecedented increase in the amount of publicly available data. Yet this surge of data has not been accompanied by a complementary increase in annotation. This lack of annotated data complicates data mining tasks in which supervised learning is preferred or required. In response, researchers have proposed many approaches to cheaply construct training sets. One approach, referred to as feature labels (McCallum & Nigam, 1999), chooses features that strongly correlate with the label space and uses instances containing those features as labeled data for training a classifier. These high precision examples help bootstrap the learning process. Another technique, crowdsourcing, exploits our everincreasing connectivity to request annotation from a broader community (who may or may not be domain experts), thereby refining and expanding the labeled data. Combining these techniques provides a means to obtain supervision from large, unlabeled data sources. In this paper, we investigate using active learning to combine these approaches in a unified framework which we call active bootstrapping. We show that this technique produces more reliable labels than either approach individually, resulting in a better classifier at mini-
- Date
- March 5, 2026
- Authors
- Jay Pujara, Ben London, Lise Getoor
- Journal
- ICML Workshop on Combining Learning Strategies to Reduce Label Cost