LLMSurgeon: Diagnosing Data Mixture of Large Language Models 事件

PRODUCT_LAUNCH2026-05-29影响: MEDIUM

LLMSurgeon: Diagnosing Data Mixture of Large Language Models arXiv:2605.30348v1 Announce Type: new Abstract: The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{{Data Mixture Surgery (DMS)}}$: given only generated text from a target LLM, estimate the