HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains arXiv:2605.28315v1 Announce Type: new Abstract: General-purpose machine translation benchmarks such as FLORES-200 have reached a saturation regime on Chinese-English pairs, where modern large language models cluster within a narrow band of high scores. Across 22 systems, FLORES-200 zh-en GEMBA scores fall in a 7.87-point range with a standard deviation of 2.29, which compresses the separation between systems

HardMTBench: Stress-Testing Chinese-English Translation on Knowledge-Intensive Domains · 相关技术