Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution 文章

ArXiv CS.CL2026-06-05NEWSen作者: Govind Ramesh, Yao Dou, Wei Xu

摘要

arXiv:2606.05486v1 Announce Type: new Abstract: Prompt ambiguity is a common source of failure in large language models, but is difficult to localize because it is a latent property of the prompt, while existing attribution methods are designed to explain observable outputs such as logits or generated tokens. We introduce PRIG, a gradient attribution method that uses a probe logit to attribute latent ambiguity to token positions. Specifically, PRIG trains a linear probe to distinguish clear prompts from ambiguous prompts and attributes the probe score to earlier token representations in the residual stream. To enable token-level evaluation, we construct synthetic ambiguity datasets across coding, math, and writing by rewriting one task-critical sentence per prompt, and complement them with a human-written gold benchmark. In this setting, PRIG localizes ambiguous spans substantially better than gradient attribution baselines, achieving 0.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据