Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO 文章

ArXiv CS.CL2026-06-05NEWSen作者: Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad

详细信息

来源站点
ArXiv CS.CL
作者
Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad
文章类型
NEWS
语言
en
发布日期
2026-06-05

摘要

arXiv:2606.05174v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes.

相关事件

暂无数据

相关公司

暂无数据

相关人物

暂无数据