详细信息
- 来源站点
- ArXiv CS.CL
- 作者
- June M. Liu, Shaomian Zheng, He Cao, Dingnan Jin, Qing Cui, Jun Zhou
- 文章类型
- NEWS
- 语言
- en
- 发布日期
- 2026-06-11
摘要
arXiv:2606.11643v1 Announce Type: new Abstract: Large language models often remain sensitive to answer format: a question solved correctly in one form may fail in another semantically equivalent form. To study this gap, we define cross-format robustness as the extent to which a model answers the same underlying question consistently across formats. We then compare full-format training with FormatMix, which expands only a subset of training items into multiple equivalent formats using either random or targeted selection. Across GLM4 and Llama-3.1, multi-format supervision consistently improves both task performance and cross-format robustness, whereas Multiple-choice question (MCQ)-only supervision alone brings little benefit and can even reduce robustness. We further find that expanding only about 30% of the training set into multiple formats often recovers most of the gain from full-format training, and this effect appears across the model families and sizes we study.