Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning 文章

ArXiv CS.CV2026-06-02NEWSen作者: Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang

摘要

arXiv:2606.01558v1 Announce Type: new Abstract: The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据