Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails 文章

ArXiv CS.CL2026-06-05NEWSen作者: Marco Antonio Stranisci, A Pranav, Rossana Damiano, Christian Hardmeier, Anne Lauscher

摘要

arXiv:2606.05936v1 Announce Type: new Abstract: Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech.

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails 文章

摘要

相关事件查看全部 (1)

相关公司

相关人物

相关产品查看全部 (1)

相关技术