On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training 文章

ArXiv CS.CV2026-05-29NEWSen作者: Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng

详细信息

来源站点: ArXiv CS.CV
作者: Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng
文章类型: NEWS
语言: en
发布日期: 2026-05-29

摘要

arXiv:2605.29496v1 Announce Type: cross Abstract: Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2.

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training 文章

详细信息

摘要

相关事件

相关公司

相关人物

相关产品

相关技术查看全部 (5)