On Causal and Anticausal LLM-based Data Synthesis

Abstract

While Large Language Models (LLMs) have been increasingly used to generate synthetic data for various downstream tasks, researchers overlook the causal direction in the data synthesis process. A natural causal direction should contain two steps: diverse raw data are generated first, and subsequently annotated for downstream tasks. However, most LLM-based methods adopt an anticausal direction: embedding label information in the prompt to force LLMs to generate targeted data. This reversal raises a critical question: How does the direction of data synthesis impact the quality and utility of the synthetic data? In this work, we empirically study the impact of causal and anticausal data synthesis. To do so, we first design simple yet effective prompting strategies to control the causal direction of LLM-based data synthesis. Using GPT-5 as the data generator, we construct synthetic datasets for three distinct machine …

Date: 2026
Authors: Bohan Jiang, Pingchuan Ma, Zhuoyu Shi, Fred Morstatter, Adrienne Raglin, Huan Liu
Book: Proceedings of the Nineteenth ACM International Conference on Web Search and Data Mining
Pages: 1160-1164

View Paper

Information Sciences Institute

Publications

On Causal and Anticausal LLM-based Data Synthesis

Abstract