TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech 文章

ArXiv CS.CL2026-05-29NEWSen作者: Girish A. Koushik, Helen Treharne, Diptesh Kanojia

摘要

arXiv:2601.11178v2 Announce Type: replace-cross Abstract: Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据