Profiling-Driven Adaptive Distributed Transformer Inference on Embedded Edge Deployment 文章

ArXiv CS.AI2026-05-26NEWSen作者: Muhammad Azlan Qazi, Alexandros Iosifidis, Qi Zhang

摘要

arXiv:2605.25682v1 Announce Type: cross Abstract: Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical benefits on real hardware remain unclear: prior work relies largely on simulations that overlook hardware-specific communication overheads. We present a hardware prototype study on NVIDIA Jetson Orin Nano devices connected over WiFi. Our key finding is that the dominant bottleneck is not just network bandwidth but also the CPU-GPU staging during communication. Because Jetson's integrated GPU architecture lacks the PCIe/NVLink pathway that NCCL requires, all inter-device data communication should be routed through GLOO and staged in CPU memory; an overhead that scales with communication data volume and makes full-tensor exchange slower than single-device inference across the batch sizes for medium sized models such as ViT.