INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models 文章

ArXiv CS.AI2026-05-26NEWSen作者: Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald

摘要

arXiv:2510.01389v2 Announce Type: replace-cross Abstract: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $\pi_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks.