AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue 文章

ArXiv CS.CL2026-05-26NEWSen作者: Jihyung Park, Saleh Afroogh, Junfeng Jiao

摘要

arXiv:2605.23974v1 Announce Type: new Abstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model. We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据