Safety

Prompt Injection

An attack where untrusted input overrides or manipulates the model's original instructions.

Direct injection: a user types 'Ignore previous instructions...'. Indirect injection: hostile content lives in a fetched webpage, PDF, or email the model reads.

Defenses: input/output guardrails, instruction hierarchies, tool-call allowlists, content provenance tagging, and adversarial evals in CI.

Go deeper

Read the full pillar guide on LLM Evaluation or compare evaluation tools in the Tool Comparison Hub.