Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment 文章

ArXiv CS.CV2026-06-02NEWSen作者: Bishr Omer Abdelrahman Adam, Xu Li

摘要

arXiv:2606.02002v1 Announce Type: new Abstract: Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.