Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery 文章

ArXiv CS.CV2026-05-27NEWSen作者: Niels Sombekke, Rob G. J. Wijnhoven, Martin R. Oswald

摘要

arXiv:2605.26381v1 Announce Type: new Abstract: We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities.