VGGSounder: Audio-Visual Evaluations for Foundation Models 事件

PRODUCT_LAUNCH2026-06-04影响: MEDIUM

VGGSounder: Audio-Visual Evaluations for Foundation Models arXiv:2508.08237v4 Announce Type: replace-cross Abstract: The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities.