Segment to Focus: Guiding Latent Action Models in the Presence of Distractors 文章

ArXiv CS.CV2026-05-28NEWSen作者: Marcus Fechner, Hamza Adnan, Constantin C. L\"uth, Matthew T. Jackson, Alexey Zakharov, J. Marius Z\"ollner

摘要

arXiv:2602.02259v2 Announce Type: replace-cross Abstract: Latent action models (LAMs) offer a promising path to pre-training embodied agents on large amounts of action-free video. They infer latent actions between consecutive observations that can later be decoded to ground-truth actions using a small number of labels. However, recent work has shown that this recipe fails in the presence of action-correlated visual distractors common in real-world video, such as dynamic backgrounds, camera shake, or other moving objects. In these scenarios, the standard reconstruction objective drives latent actions to encode exogenous motion instead of agent-controlled dynamics, resulting in policies that underperform when fine-tuned. We observe, however, that endogenous and exogenous factors are typically spatially separated in pixel space: control-relevant change is concentrated on the agent, while distractor motion occurs elsewhere.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据