Glossary

The AI & LLM Glossary

Practitioner-grade definitions for AI evaluation, RAG, safety, and monitoring — written for engineers shipping real systems.

Evaluation

The process of systematically measuring the quality, safety, and reliability of large language model outputs.

Using a capable LLM to score outputs of another LLM against a rubric.

Measuring retrieval-augmented generation systems across both the retrieval and generation stages.

An LLM output that is fluent and confident but factually wrong or unsupported by the provided context.

Adversarial testing of AI systems to surface safety, security, and reliability failures before users do.

An attack where untrusted input overrides or manipulates the model's original instructions.

An n-gram precision metric originally built for machine translation, comparing model output to one or more reference texts.

A recall-oriented family of metrics that measures overlap between generated text and reference summaries.

A measure of how 'surprised' a language model is by a sequence — lower is better.

Degradation in model performance over time as production data diverges from training or evaluation data.