DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding 事件
PRODUCT_LAUNCH2026-05-27影响: MEDIUM
DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding arXiv:2605.26656v1 Announce Type: new Abstract: Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevita
相关产品查看全部 (10)
相关报道查看全部 (1)
DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding
ArXiv CS.CV2026-05-27