← Case StudiesSaaS

Stopping a $40K/month agent cost runaway

Trajectory evals and tool-call SLOs caught a recursive-tool-call pattern burning compute in production.

The problem

A B2B SaaS support agent was silently looping on a failing tool, multiplying spend without alerting.

Root cause

No per-trajectory cost cap, missing circuit breaker on tool errors, no replay observability.

Approach

Built trajectory-level evals, added per-run cost SLOs, instrumented OpenTelemetry traces, introduced replay harness.

Framework used

LangSmith traces + custom trajectory evals + cost SLOs

Results

  • $40K/month spend recovered
  • p95 trajectory length cut 4×
  • Mean time to debug: hours → minutes

Lessons learned

  • Agents need SLOs, not just metrics
  • Replay is non-negotiable
  • Cost is a quality signal

Facing a similar problem?

Book an assessment with our team.

Book AI Assessment