When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference 文章

ArXiv CS.AI2026-05-28NEWSen作者: Pu Li, Jiawen Qi, Qinyu Chen

摘要

arXiv:2605.27435v1 Announce Type: cross Abstract: Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据