VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio 文章

ArXiv CS.AI2026-06-02NEWSen作者: Maris Basha, Anja Zai, Sabine Stoll, Richard Hahnloser

摘要

arXiv:2512.10120v2 Announce Type: replace-cross Abstract: General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings, with no parameters updated and no labels used (a label-free PCA whitening is fit per subset to correct anisotropy). VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds, isolating content representation from source separation (polyphonic mixtures are out of scope). We evaluate embeddings with Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation, calibrated by lift over an empirical permutation baseline.

相关公司

暂无数据

相关人物

暂无数据

相关技术

暂无数据