Pretraining Language Models on Historical Text 文章

ArXiv CS.CL2026-06-03NEWSen作者: Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

摘要

arXiv:2606.02991v1 Announce Type: new Abstract: We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct.