Google Load Balancer Outage
From - Tuesday 17th July 2018, 20:14 UTC
Until - Tuesday 17th July 2018, 20:52 UTC
Tuesday 17th July 2018 (all times are UTC)
- 18:58 - Triggered live 500s errors alert
- 19:04 - 500s alert recovered
- 20:15 - Requests passing through load balancers to our APIs nosedive
- 20:18 - Triggered live 500s errors alert
- 20:18 - On call engineer gets paged, Slack alerts summon other engineers as well
- 20:18 - Initially Google status page is only reporting an earlier issue with Stackdriver
- 20:18 - Getting 502’s from Grafana, Prometheus, Ravelin API & Ravelin dashboard
- 20:19 - Pingdom alert arrives api.ravelin.com is DOWN
- 20:20 - Can still SSH into machines and they are running fine - just no one can talk to them
- 20:39 - Updated statuspage.io with details of outage
- 20:40 - Also notified clients via Slack support channels
- 20:44 - Google acknowledge Stackdriver / AppEngine / Cloud Networking global outages
- 20:44 - Pingdom alert arrives api.ravelin.com is UP
- 20:52 - Connectivity is resumed and network traffic increases again
- 21:06 - Update statuspage.io that issue is resolved and being monitored
- 21:06 - Notify clients via Slack again
- 21:42 - Update statuspage.io that everything is normal again.
We are unable to measure the amount of data loss. The API was almost totally unavailable, there was a small portion of traffic still passing through to the platform.
The root cause of the unavailability of the Ravelin API’s and dashboard was due to a global outage of Google’s load balancer in front of our GCP Compute machines.
We are currently awaiting Google’s own post mortem on this incident, to help us understand how we can possibly mitigate this type of failure in the future.
- Improve time between initial alerts to updating status page & notifying clients
- Investigate how we can make our platform more resilient to similar load balancer
failures in the future (pending Google’s own post mortem)