Temporal Stability and Few-Shot Prompting in Math Task Assessment 文章

ArXiv CS.AI2026-05-29NEWSen作者: Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

摘要

arXiv:2605.30151v1 Announce Type: new Abstract: As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category).