摘要
arXiv:2604.17943v2 Announce Type: replace Abstract: RAG-based question-answering (QA) in specialist domains faces a cold-start problem: lack of evaluative benchmarks and absence of labeled data for post-training. We present DoRA (Domain-oriented RAG Assessment), a novel benchmark construction and evaluation framework using only a small set of specialist domain documents. DoRA systematically generates synthetic QA training and evaluation datasets with auditable evidence across five domain-specific intents. To mitigate same-pipeline circularity, DoRA's training and test splits use different LLM families (Claude Sonnet for training; GPT-4o for test) drawn from disjoint seed-document corpora. Instantiated on 40 defense-related documents (written in English), DoRA yields ~6.6K curated instances. Compared against 8 LLM baselines over a benchmark of 1,259 samples, a LoRA-adapted Llama3.