GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors 文章

ArXiv CS.CL2026-05-28NEWSen作者: Parth Bhalerao, Jeromy Chang, David Chou, Oana Ignat

摘要

arXiv:2605.27866v1 Announce Type: new Abstract: Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification.

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors 文章

摘要

相关事件查看全部 (2)

相关公司

相关人物

相关产品查看全部 (14)

相关技术查看全部 (9)