Combating Data Laundering in LLM Training 文章

ArXiv CS.AI2026-05-29NEWSen作者: Muxing Li, Zesheng Ye, Sharon Li, Feng Liu

摘要

arXiv:2604.01904v2 Announce Type: replace-cross Abstract: Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, superior performance (e.g., higher confidence or lower loss) on a sample relative to the untrained data implies it was part of the training corpus, as LLMs tend to perform better on data they have seen during training. However, this detection becomes fragile under data laundering, a practice of transforming the stylistic form of proprietary data, while preserving critical information to obfuscate data provenance. When an LLM is trained exclusively on such laundered variants, it no longer performs better on originals, erasing the signals that standard detections rely on.

Combating Data Laundering in LLM Training 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术