What is LLM evaluation?
LLM evaluation is the discipline of measuring whether a language-model-powered system — a prompt, RAG pipeline, agent, or fine-tuned model — does what it's supposed to do, safely and reliably, before and after it ships. It blends classic ML evaluation (datasets, metrics, regression) with new techniques unique to generative systems: reference-free scoring, LLM-as-a-judge, faithfulness checks, and adversarial probing.
Why it matters
Without evals, every prompt change is a guess. Teams who ship LLM features without an eval harness either move slowly (manual spot-checks) or break things in production (hallucinations, regressions, silent quality drops). A modern eval stack is the equivalent of unit + integration + observability tests for AI features.
Core metrics
Pick metrics that match the task. Most production stacks combine several categories:
- Reference-based: BLEU, ROUGE, exact-match, METEOR — fast, deterministic, surface-level.
- Semantic: BERTScore, embedding cosine similarity — captures paraphrasing.
- Reference-free: LLM-as-a-judge with grounded rubrics — handles open-ended tasks.
- Task-specific: Faithfulness, citation accuracy, tool-call correctness, JSON validity.
- Safety: Toxicity, PII leak, jailbreak resistance, bias probes.
See the BLEU, ROUGE, and perplexity glossary entries.
LLM-as-a-judge
An evaluator LLM scores outputs against a rubric you write. Done well, judges scale qualitative review and stay calibrated by sampling human labels weekly. Done badly, they amplify position bias, verbosity bias, and self-preference. Best practices: structured JSON output, chain-of-thought before verdict, pairwise comparisons, randomized order, and a small human-labeled calibration set.
Read more: LLM-as-a-judge glossary entry.
Evaluating RAG systems
RAG evaluation splits into retrieval (recall@k, MRR, NDCG) and generation (faithfulness, answer relevance, context utilization). A 50–200 query labeled set beats a noisy 10k set. Track both halves on every change — a generation regression usually traces back to a retrieval regression two commits earlier.
Regression testing & CI
Treat your eval set like a test suite. On every prompt change, model swap, or retriever tweak, run the harness, compare scores to baseline, and block merges that regress critical metrics. Frameworks like Promptfoo, DeepEval, and Braintrust make this a single CLI command.
Production monitoring
Offline evals catch known failures. Production monitoring catches the unknown ones: drift, prompt-injection attempts, cost spikes, and quality regressions on live traffic. Sample live traces, judge them automatically, and alert on metric drops.
Tools & frameworks
Common picks: Promptfoo and DeepEval for offline evals, Ragas for RAG-specific metrics, Langfuse and LangSmith for tracing + online evals, Braintrust for full eval-ops, and Helicone for cost and latency observability.
Compare them side-by-side in the Tool Comparison Hub.