摘要
arXiv:2605.25955v1 Announce Type: new Abstract: Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze.
相关事件查看全部 (1)
QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability
2026-05-26PRODUCT_LAUNCH影响: MEDIUM
相关人物
暂无数据