RELIABILITY

Reliability Playbook for Production Teams

Operational defaults that keep latency predictable and spend under control at scale.

Instrument every call with request IDs and model tags for traceability.
Define SLOs for P95 latency and error budgets per product surface.
Run canary traffic on new routes before shifting production mix.
Maintain circuit breakers with bounded retries and jittered backoff.
Review weekly cost variance against token throughput anomalies.