Not All Synthetic Data Is Yours to Learn From 事件

PRODUCT_LAUNCH2026-06-01影响: MEDIUM

Not All Synthetic Data Is Yours to Learn From arXiv:2605.31126v1 Announce Type: new Abstract: Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities alrea