VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild 事件

PRODUCT_LAUNCH2026-05-28影响: MEDIUM

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild arXiv:2605.27882v1 Announce Type: new Abstract: LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refin