ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation 文章

ArXiv CS.AI2026-06-03NEWSen作者: Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

摘要

arXiv:2604.23099v2 Announce Type: replace-cross Abstract: Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded.