Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm 文章

ArXiv CS.CL2026-06-10NEWSen作者: Yuming (Rapheal), Huang, Yao Liu, Pengjie Ding, Lei Wang, Junchen Wan

摘要

arXiv:2605.27914v2 Announce Type: replace Abstract: Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving;

相关公司

暂无数据

相关人物

暂无数据

相关产品

暂无数据