Data filtering methods for training language models 文章

ArXiv CS.CL2026-05-29NEWSen作者: Egor Shevchenko, Elena Bruches

摘要

arXiv:2605.29807v1 Announce Type: new Abstract: Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples.

相关事件查看全部 (1)

Data filtering methods for training language models
2026-05-29PRODUCT_LAUNCH影响: MEDIUM

相关公司

暂无数据

相关人物

暂无数据