Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection 文章

ArXiv CS.AI2026-05-27NEWSen作者: Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica

详细信息

来源站点
ArXiv CS.AI
作者
Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica
文章类型
NEWS
语言
en
发布日期
2026-05-27

摘要

arXiv:2602.01518v2 Announce Type: replace Abstract: Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output.