Understanding Data Temporality Impact on Large Language Models Pre-training 文章

ArXiv CS.CL2026-05-26NEWSen作者: Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave

详细信息

来源站点: ArXiv CS.CL
作者: Hippolyte Pilchen, Romain Fabre, Franck Signe Talla, Patrick Perez, Edouard Grave
文章类型: NEWS
语言: en
发布日期: 2026-05-26

摘要

arXiv:2605.22769v2 Announce Type: replace Abstract: Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training.

Understanding Data Temporality Impact on Large Language Models Pre-training 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (1)

相关技术