KletterMix: Climbing Toward High-Quality German Pretraining Data 事件

BREAKTHROUGH2026-06-03影响: HIGH

KletterMix: Climbing Toward High-Quality German Pretraining Data arXiv:2606.03773v1 Announce Type: new Abstract: High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining an