X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling 文章

ArXiv CS.CV2026-05-26NEWSen作者: Baolu Li (Victor), Jingyu Qian (Victor), Rui Guo (Victor), Yilun Chen (Victor), Hanpeng Liu (Victor), Yuan Lin (Victor), Junhong Zhou (Victor), Ruixin Liu (Victor), Willow Yang (Victor), Yutong Zheng (Victor), Zhenli Zhang (Victor), Tenglong (Victor), Gu, Zhuangzhuang Ding, Pengkun Zheng, Yu Zhang, Xianming Liu

摘要

arXiv:2605.24892v1 Announce Type: new Abstract: Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control.