ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training 文章

ArXiv CS.AI2026-05-26NEWSen作者: Minghao Li, Alicia Golden, Samuel Hsia, Michael Kuchnik, Adi Gangidi, Xu Zhang, Ashmitha Jeevaraj Shetty, Zachary DeVito, Weiwei Chu, Dong He, Haoci Zhang, Yuchen Hao, Ruoming Pang, James Hongyi Zeng, Ying Zhang, Minlan Yu, Carole-Jean Wu

摘要

arXiv:2605.24326v1 Announce Type: cross Abstract: The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. We refer to such paradigm as "scale-across" training. As infrastructure expands, the system design space becomes increasingly intricate, encompassing new model architectures, hardware heterogeneity, and evolving communication patterns. Drawing from Meta's production experience, we highlight the complexities of deploying training jobs across a few data centers housing hundreds of thousands of GPUs. To accelerate exploration of the large design space and to enable efficient training for frontier model development, we conduct in-depth characterization of three key design dimensions: parallelism placement, parallelism scheduling, and network layer technologies.