The problem
A B2B SaaS support agent was silently looping on a failing tool, multiplying spend without alerting.
Root cause
No per-trajectory cost cap, missing circuit breaker on tool errors, no replay observability.
Approach
Built trajectory-level evals, added per-run cost SLOs, instrumented OpenTelemetry traces, introduced replay harness.
Framework used
LangSmith traces + custom trajectory evals + cost SLOs
Results
- $40K/month spend recovered
- p95 trajectory length cut 4×
- Mean time to debug: hours → minutes
Lessons learned
- Agents need SLOs, not just metrics
- Replay is non-negotiable
- Cost is a quality signal