Evaluation

LLM Evaluation

The process of systematically measuring the quality, safety, and reliability of large language model outputs.

LLM evaluation is how teams decide whether a model — or a prompt, retrieval pipeline, or agent built on top of one — is good enough to ship and stays good once it's in production.

A modern eval stack combines reference-based metrics (BLEU, ROUGE, exact-match), reference-free metrics (LLM-as-a-judge, embedding similarity), task-specific checks (faithfulness, citation accuracy, tool-call correctness), and human review on a small golden set.

The goal is not a single score. The goal is a feedback loop: a labeled eval dataset, an automated harness, regression tracking on every change, and online monitoring once you ship.

Go deeper

Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.