A noise audit of human-labeled benchmarks for machine commonsense reasoning

Abstract

With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single ‘ground-truth’). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale …

Date: April 14, 2024
Authors: Mayank Kejriwal, Henrique Santos, Ke Shen, Alice M Mulvehill, Deborah L McGuinness
Journal: Scientific Reports
Volume: 14
Issue: 1
Pages: 8609
Publisher: Nature Publishing Group UK

View Paper

Information Sciences Institute

Publications

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Abstract