Bounded-Compute Multimodal Regression for Product-Rating Prediction 文章

ArXiv CS.CV2026-05-28NEWSen作者: William Leach, Ru He, Sizhuo Ma, Yizhen Jia, Min Cao, Jian Wang, Rick Cao

摘要

arXiv:2605.27737v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation.

Bounded-Compute Multimodal Regression for Product-Rating Prediction 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (3)

相关技术查看全部 (6)