Pretraining Language Models on Historical Text 事件

PRODUCT_LAUNCH2026-06-03影响: MEDIUM

Pretraining Language Models on Historical Text arXiv:2606.02991v1 Announce Type: new Abstract: We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historic