Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation 文章

ArXiv CS.CL2026-06-05NEWSen作者: Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham

摘要

arXiv:2606.05988v1 Announce Type: cross Abstract: Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers.