Combating Data Laundering in LLM Training 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

Combating Data Laundering in LLM Training arXiv:2604.01904v2 Announce Type: replace-cross Abstract: Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, superior performance (e.g., higher confidence or lower loss) on a sample relative to the untrained data implies it was part of the training corpus, as LLMs tend to perform better on data they have seen during training. However, this detection becomes fragile unde