Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data arXiv:2601.19936v2 Announce Type: replace-cross Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the gap between the target token and the model's top-1 prediction, as well as local correlatio