The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next 文章

ArXiv CS.CL2026-05-26NEWSen作者: Adil Amin

摘要

arXiv:2605.18840v2 Announce Type: replace-cross Abstract: Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies systematically: per-lab coupling slopes span $5\times$ (Google $1.15$ vs. DeepSeek $0.23$), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first ($\Delta h = 15.9$~pp); Anthropic oscillates between coding excursions and recovery.