RAG Evaluation
Measuring retrieval-augmented generation systems across both the retrieval and generation stages.
RAG evaluation splits into two halves: retrieval quality (did we fetch the right context?) and generation quality (did we use that context faithfully?).
Retrieval is usually scored with recall@k, precision@k, MRR, and NDCG against a labeled query → chunk dataset. Generation is scored with faithfulness, answer relevance, and context utilization — typically via LLM-as-a-judge with grounded rubrics.
Tools like Ragas, DeepEval, and TruLens automate this, but the dataset is what matters: 50–200 carefully curated Q/A pairs beats 10,000 noisy ones.
Go deeper
Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.