LLM-as-a-Judge
Using a capable LLM to score outputs of another LLM against a rubric.
Judges scale qualitative evaluation: helpfulness, faithfulness, tone, instruction-following — anything you can describe in a rubric.
Best practices: ground the judge with the source context, require structured JSON output, calibrate against human labels, and use chain-of-thought before the verdict.
Known biases: position bias, verbosity bias, self-preference. Mitigate with pairwise comparison, randomized order, and multiple judges.
Go deeper
Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.