CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation 文章

ArXiv CS.CV2026-06-16NEWSen作者: Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli

详细信息

来源站点: ArXiv CS.CV
作者: Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli
文章类型: NEWS
语言: en
发布日期: 2026-06-16

摘要

arXiv:2601.08010v2 Announce Type: replace Abstract: Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model.

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品查看全部 (4)

相关技术查看全部 (4)