Zamba2-VL Technical Report 文章

ArXiv CS.CV2026-06-02NEWSen作者: Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

摘要

arXiv:2606.00390v1 Announce Type: new Abstract: We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.

Zamba2-VL Technical Report 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (22)

相关技术查看全部 (3)