Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision 文章

ArXiv CS.AI2026-05-29NEWSen作者: Jiayi Fang

摘要

arXiv:2605.28865v1 Announce Type: cross Abstract: What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry -- direction accuracy 0.677+-0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+-0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co-improve (Spearman r=-0.61, p=0.004), consistent with the shared-driver account. We confirm this through a double knockout: standard KL regularization (beta=0.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据