摘要
arXiv:2512.14561v2 Announce Type: replace Abstract: Despite the growing promise of large language models (LLMs) in automated essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLM-generated scores and human ratings. Agreement levels varied substantially both across and within studies, with reported values spanning a wide range. Overall, the findings suggest that LLM-human agreement is highly context-dependent. Implications, challenges, and directions for future research are discussed.
相关事件查看全部 (1)
Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis
2026-05-27PRODUCT_LAUNCH影响: MEDIUM
相关公司查看全部 (3)
相关人物
暂无数据