BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation 事件

PRODUCT_LAUNCH2026-05-27影响: MEDIUM

BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation arXiv:2605.27050v1 Announce Type: new Abstract: We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources inclu