Metrics

BLEU Score

An n-gram precision metric originally built for machine translation, comparing model output to one or more reference texts.

BLEU (Bilingual Evaluation Understudy) measures overlap of 1–4 word sequences between candidate and reference, with a brevity penalty for short outputs.

It scores 0 to 1 (or 0–100). It is fast and reproducible but rewards surface-level word overlap, so it misses paraphrasing and semantic correctness.

Useful as a regression signal for translation and summarization; pair it with a semantic metric (BERTScore, LLM-as-a-judge) for modern LLM tasks.

Go deeper

Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.