KletterMix: Climbing Toward High-Quality German Pretraining Data 文章

ArXiv CS.CL2026-06-03NEWSen作者: Maurice Kraus, Ruben H\"arle, Sebastian Sztwiertnia, Abbas Goher Khan, Mehdi Ali, Michael Fromm, Kristian Kersting

摘要

arXiv:2606.03773v1 Announce Type: new Abstract: High-quality pretraining data is a central ingredient in modern language models, but German-language resources remain far less developed than their English counterparts: they are often smaller, less carefully curated, weakly documented, and rarely validated through controlled training experiments. We introduce KletterMix, a high-quality German corpus for language model pretraining and annealing, designed as a reusable dataset artifact for the natural language processing and modeling community. KletterMix is built by translating a state-of-the-art English pretraining corpus into German while preserving document boundaries, metadata, source structure, and topical diversity. This construction yields a German corpus with the scale and diversity of a modern pretraining dataset, while enabling direct comparison to its English source.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据