JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments arXiv:2602.18527v2 Announce Type: replace Abstract: Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a fram