When Mean CE Fails: Median CE Can Better Track Language Model Quality 文章

ArXiv CS.AI2026-05-26NEWSen作者: Hao Guo, Simon Dennis, Rivaan Patil, Kevin Shabahang

摘要

arXiv:2605.24667v1 Announce Type: new Abstract: Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution.