Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm 文章

ArXiv CS.CL2026-06-16NEWSen作者: Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang

详细信息

来源站点: ArXiv CS.CL
作者: Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2602.11543v3 Announce Type: replace Abstract: Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing.

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (2)

相关技术查看全部 (4)