Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks 事件

PRODUCT_LAUNCH2026-06-02影响: MEDIUM

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks arXiv:2606.00920v1 Announce Type: cross Abstract: Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Stan

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks · 相关产品