When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability 文章

ArXiv CS.AI2026-06-06NEWSen作者: Suraj Babu Thimma Krishnaram

摘要

arXiv:2606.05654v1 Announce Type: cross Abstract: Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern.

When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品

相关技术