Metrics

Perplexity

A measure of how 'surprised' a language model is by a sequence — lower is better.

Perplexity is the exponentiated cross-entropy loss on a held-out corpus. Useful for comparing base models on the same dataset.

It does not measure task quality. A model can have low perplexity and still hallucinate or fail instructions.

Go deeper

Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.