Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation 文章

ArXiv CS.CV2026-05-27NEWSen作者: Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

摘要

arXiv:2605.11651v4 Announce Type: replace Abstract: Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation.