$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval 文章

ArXiv CS.AI2026-06-03NEWSen作者: Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

摘要

arXiv:2601.20844v3 Announce Type: replace-cross Abstract: This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of $m$ object vectors so that every subset of size at most $k$ is exactly retrieved by score comparison. Our result shows MED is $\Theta(k)$, independent of $m$, for inner product, Euclidean distance, and cosine similarity. We then consider Robust MED (RMED), where all vectors are unit normed and an $\epsilon$ gap of scores is required. We derive the $m$-dependent feasibility ceiling $\epsilon_\star(m,k)=m/\sqrt{k(m-1)(m-k)}$, which approaches $1/\sqrt{k}$ when $m\gg k$, and a Gaussian centroid construction gives a robust witness upper bound in the feasible margin regime. Numerical simulation on synthetic top-$2$ retrieval with cyclic polytope and centroid query optimization confirmed our theoretical claims.

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据