Evaluation

LLM-as-a-Judge

Using a capable LLM to score outputs of another LLM against a rubric.

Judges scale qualitative evaluation: helpfulness, faithfulness, tone, instruction-following — anything you can describe in a rubric.

Best practices: ground the judge with the source context, require structured JSON output, calibrate against human labels, and use chain-of-thought before the verdict.

Known biases: position bias, verbosity bias, self-preference. Mitigate with pairwise comparison, randomized order, and multiple judges.

Go deeper

Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.