RAG Citation Accuracy Test Plan Builder

Design a test plan for retrieval-augmented generation citation accuracy, source grounding, quote fidelity, chunk mapping, regressions, and release gates.

Prompt Template

You are an AI application QA lead. Build a RAG citation accuracy test plan for an application that answers from retrieved sources.

Application context: [product, users, risk level]
Knowledge sources: [docs, PDFs, database records, tickets, policies, help center, contracts]
Retrieval stack: [vector DB, hybrid search, reranker, embeddings model, chunking strategy, metadata filters]
Generation stack: [LLM, prompt style, citation format, streaming, tool calls]
Citation requirements: [document title, URL, page, paragraph, chunk ID, quote, footnote, inline citation]
Known failure modes: [wrong source, unsupported claim, stale doc, quote mismatch, citation to irrelevant chunk]
Test data available: [golden questions, expert answers, source labels, production logs, synthetic cases]
User workflows: [search answer, support reply, policy Q&A, contract review, research assistant]
Compliance or audit needs: [legal, medical, financial, enterprise, internal policy]
Release process: [CI, manual QA, eval harness, staging, canary]
Success thresholds: [acceptable grounding score, zero-tolerance failures, human review rules]

Create:
1. Citation accuracy dimensions with precise pass/fail criteria.
2. Golden test set design covering easy, hard, ambiguous, stale, conflicting, and no-answer cases.
3. Source-grounding rubric for claim support, citation relevance, quote fidelity, metadata correctness, and refusal behavior.
4. Retrieval diagnostics for recall, ranking, chunk quality, metadata filters, and source freshness.
5. Generation diagnostics for unsupported synthesis, citation formatting, hallucinated quotes, and overconfident answers.
6. Automated eval ideas plus human review workflow for borderline cases.
7. Regression test plan for chunking, embedding model, reranker, prompt, and knowledge-base updates.
8. Release gates and severity levels that block deployment.
9. Dashboard spec for citation precision, unsupported-claim rate, no-answer accuracy, source age, and reviewer agreement.
10. Example test cases and expected results table.

Be strict. A cited answer is not correct unless the cited source actually supports the claim.

Example Output

Citation Eval Dimensions

| Dimension | Pass Criteria | Blocker Example |

|---|---|---|

| Claim support | Every material claim is supported by cited source text | Answer says refunds are 60 days but cited policy says 30 days |

| Citation relevance | Citation points to the exact document section used | Citation points to general FAQ, not the pricing clause |

| Quote fidelity | Quoted text matches source text exactly or is clearly paraphrased | Model invents a quote with quotation marks |

| No-answer behavior | Refuses or asks for more context when sources do not support an answer | Fabricates policy from similar article |

Golden Case

Question: Can enterprise customers export audit logs for 7 years?

Expected: Answer only if the enterprise security addendum says 7-year retention. Cite document title, section, page, and chunk ID. If only standard plan docs are retrieved, answer that the available sources do not confirm it.

Release Gate

Zero high-severity unsupported claims in regulated workflows and citation precision above 95% on the golden set before production rollout.

Tips for Best Results

💡Provide examples of wrong citations you have seen so the test plan targets real failure modes.
💡Evaluate retrieval and generation separately before blaming the model.
💡Include no-answer cases; RAG systems often fail by citing weakly related sources.
💡Require citation-to-claim checks, not just citation formatting checks.

Try it with

ChatGPT Claude Gemini

Related Prompts

Coding

Code Review Assistant

Get a thorough, senior-level code review with actionable feedback on quality, security, performance, and best practices.

ChatGPTClaudeGemini

Coding

Debugging Detective

Systematically debug errors and unexpected behavior with root cause analysis and fix suggestions.

ChatGPTClaudeGemini

Coding

Code Refactoring Advisor

Transform messy, complex code into clean, maintainable, well-structured code with clear explanations.

ChatGPTClaudeGemini