EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers 文章

ArXiv CS.CV2026-06-02NEWSen作者: Jianlin Xiang, Yanshan Li, Linhui Dai

摘要

arXiv:2606.01601v1 Announce Type: new Abstract: Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance.