When prompt perturbations break your A/B test: A valid statistical test for generative surveying 文章

ArXiv CS.AI2026-05-28NEWSen作者: Hayden Helm, Carey Priebe

摘要

arXiv:2605.27463v1 Announce Type: cross Abstract: Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据