RELIABILITY
Reliability Playbook for Production Teams
Operational defaults that keep latency predictable and spend under control at scale.
- Instrument every call with request IDs and model tags for traceability.
- Define SLOs for P95 latency and error budgets per product surface.
- Run canary traffic on new routes before shifting production mix.
- Maintain circuit breakers with bounded retries and jittered backoff.
- Review weekly cost variance against token throughput anomalies.