MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models 文章

ArXiv CS.CV2026-05-26NEWSen作者: Shristi Das Biswas, Kaushik Roy

摘要

arXiv:2605.26004v1 Announce Type: new Abstract: Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据