Publications

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Abstract

With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single ‘ground-truth’). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale …

Date
April 14, 2024
Authors
Mayank Kejriwal, Henrique Santos, Ke Shen, Alice M Mulvehill, Deborah L McGuinness
Journal
Scientific Reports
Volume
14
Issue
1
Pages
8609
Publisher
Nature Publishing Group UK