Compositional Semantics for Open Vocabulary Spatio-semantic Representations 文章

ArXiv CS.CV2026-05-26NEWSen作者: Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda

摘要

arXiv:2310.04981v2 Announce Type: replace Abstract: Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and that the optimal z* is the centroid for any set Z. We derive a probabilistic bound for estimating separability of related and unrelated semantics. We prove that z* is discoverable from visual appearance and singular descriptions by iterative gradient descent. We experimentally verify our findings on four embedding spaces including CLIP and SBERT. Our results show that z* can represent up to 10 semantics encoded by SBERT, and up to 100 semantics for ideal uniformly distributed high-dimensional embeddings.