How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation 论文

2016引用 908

Topic ModelingSpeech and dialogue systemsNatural Language Processing Techniques

Natural Language Processing Techniques Topic Modeling Speech and dialogue systems

作者

摘要

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.

作者查看全部 (5)

Joëlle Pineau

Laurent Charlin

Iulian Vlad Serban

Ryan Lowe

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation 论文

摘要

作者查看全部 (5)

相关技术查看全部 (2)

相关事件

相关文章