UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations 文章

ArXiv CS.CV2026-06-02NEWSen作者: Dominik J. M\"uhlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann

摘要

arXiv:2510.13774v2 Announce Type: replace-cross Abstract: Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent generic models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a spatial representation model that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations.