Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer? 文章

ArXiv CS.CL2026-05-28NEWSen作者: Syed Huma Shah (Duke University)

摘要

arXiv:2605.27494v1 Announce Type: cross Abstract: Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe.