How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions 文章

ArXiv CS.CV2026-05-29NEWSen作者: Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

摘要

arXiv:2605.29448v1 Announce Type: cross Abstract: Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries.