The problem
Customer-facing financial Q&A bot generated confidently wrong citations in pre-launch UAT.
Root cause
Single-pass dense retrieval with weak chunking returned topically-similar but legally-incorrect passages.
Approach
Instrumented retrieval, reranking, and generation independently. Built a 400-question golden set with retrieval ground truth.
Framework used
RAGAS + custom faithfulness judge + Promptfoo CI gating
Results
- Hallucination rate 23% → 3%
- Citation accuracy 71% → 96%
- Launch unblocked in 6 weeks
Lessons learned
- Layer the eval before tuning the model
- Reranking pays for itself
- Faithfulness needs a domain-tuned judge