Have you ever had an outage and how did you respond to it?
Yes we have had a couple of outages in the past as described below:
Outage 1
- Cause: Our RabbitMQ queueing service was down. This caused our API endpoints to fail.
- Downtime: under 30 minutes.
- Immediate action: A hot standby RabbitMQ service was spawned and all traffic was redirected onto the same.
- Long term action: A Highly Available RabbitMQ service was deployed.
Outage 2
- Cause: Our internal admin services misbehaved and choked client API endpoints.
- Downtime: under 30 minutes.
- Immediate action: Admin services were temporarily disabled and client APIs were brought up.
- Long term action: Since both services were functioning on the same application servers we had single points of failure here. Admin services were isolated/de-coupled from client API servers to avoid such scenarios in the future. Our team gets alerted instantly in such scenarios and we have been to resolve such problems fairly quickly (under 30 mins) As highlighted above we follow-up such outages with a detailed post-mortem and put systems/checks in place to avoid them in the future