The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next 事件

PRODUCT_LAUNCH2026-05-26影响: MEDIUM

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next arXiv:2605.18840v2 Announce Type: replace-cross Abstract: Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses