摘要
arXiv:2603.10422v2 Announce Type: replace Abstract: World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to the VLA policy without pixel-space supervision. World2Act operates in two stages: 1) it induces a shared video-action latent space by contrastively aligning WM-dynamics latents with action embeddings, and 2) it post-trains the VLA by guiding policy action representations toward WM-imagined dynamics rather than decoded pixels. Built on GR00T-N1.6, World2Act delivers absolute success-rate gains of up to +2.5% on simulation benchmarks (RoboCasa, LIBERO, Bridge-SIMPLER) and +6.
相关事件查看全部 (1)
相关公司
暂无数据
相关人物
暂无数据