From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs 文章
摘要
arXiv:2605.09370v2 Announce Type: replace-cross Abstract: Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however, remains scarce. This technical report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The cluster operates within a cross-organizational environment in which five parties (SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data) share a unified monitoring pipeline. This arrangement enabled joint diagnosis of a 60-node-scale storage I/O bottleneck that did not appear at 2-4-node scale, a production-scale phenomenon no single team could isolate alone.
相关事件查看全部 (1)
相关公司查看全部 (3)
相关人物
暂无数据