Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data 事件

BREAKTHROUGH2026-06-01影响: HIGH

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data arXiv:2601.19936v2 Announce Type: replace-cross Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlatio