Publications
A noise audit of human-labeled benchmarks for machine commonsense reasoning
Abstract
With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single ‘ground-truth’). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale …
- Date
- April 14, 2024
- Authors
- Mayank Kejriwal, Henrique Santos, Ke Shen, Alice M Mulvehill, Deborah L McGuinness
- Journal
- Scientific Reports
- Volume
- 14
- Issue
- 1
- Pages
- 8609
- Publisher
- Nature Publishing Group UK