A large-scale pipeline for LLM-assisted corpus annotation: variation and change in the English consider construction 文章

ArXiv CS.CL2026-06-16NEWSen作者: Cameron Morin, Matti Marttinen Larsson

详细信息

来源站点
ArXiv CS.CL
作者
Cameron Morin, Matti Marttinen Larsson
文章类型
NEWS
语言
en
发布日期
2026-06-16

摘要

arXiv:2510.12306v3 Announce Type: replace Abstract: As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline's accessibility and effectiveness through a diachronic case study of variation in the English evaluative consider construction (consider X as/to be/{\O} Y). We annotate 143,933 'consider' concordance lines from the Corpus of Historical American English (COHA) via the OpenAI API in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures.