详细信息
- 来源站点
- ArXiv CS.CL
- 作者
- Nicol\'as Benjam\'in Ocampo, Agnes Paullate Nyiranziza, Davide Ceolin
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-05-28
摘要
arXiv:2605.28313v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $\kappa$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477).