Model Parallelism With Subnetwork Data Parallelism 文章

ArXiv CS.AI2026-06-02NEWSen作者: Vaibhav Singh, Zafir Khalid, Pietro Cagnasso, Edouard Oyallon, Eugene Belilovsky

摘要

arXiv:2507.09029v5 Announce Type: replace-cross Abstract: Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both transformers and CNNs. In experiments spanning 1B LLaMA pre-training on FineWeb to ResNet-18 on CIFAR, SDP reduces per device memory usage by 28%-60% while maintaining or improving performance under FLOP-matched settings.

Model Parallelism With Subnetwork Data Parallelism 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (11)

相关技术查看全部 (6)