Most production incidents are detectable before users report them. A basic monitoring stack gives you that lead time.
Metrics you should track
- Gateway uptime and restart count.
- Channel-specific error rate.
- Provider latency and timeout rate.
- Pairing/auth failures by origin.
Alerts to configure first
- More than 3 restarts in 15 minutes.
- Health endpoint failing for 2 consecutive checks.
- Sudden increase in unauthorized 1008 closures.
- Provider key/auth failures above threshold.
Log strategy
- Keep structured logs with request IDs.
- Separate channel logs from core gateway logs.
- Store enough history for root-cause analysis.
Incident workflow
- Confirm blast radius (single channel vs global).
- Roll back recent config changes.
- Re-validate auth tokens and origins.
- Re-run health checks and smoke tests.
Good monitoring does not need to be complex. It needs to be specific, noisy only when required, and tied to clear recovery actions.