Network connectivity
Incident Report for Ravelin

Google Load Balancer Outage

From - Tuesday 17th July 2018, 20:14 UTC
Until - Tuesday 17th July 2018, 20:52 UTC


Tuesday 17th July 2018 (all times are UTC)
  • 18:58 - Triggered live 500s errors alert
  • 19:04 - 500s alert recovered
  • 20:15 - Requests passing through load balancers to our APIs nosedive
  • 20:18 - Triggered live 500s errors alert
  • 20:18 - On call engineer gets paged, Slack alerts summon other engineers as well
  • 20:18 - Initially Google status page is only reporting an earlier issue with Stackdriver metrics
  • 20:18 - Getting 502’s from Grafana, Prometheus, Ravelin API & Ravelin dashboard
  • 20:19 - Pingdom alert arrives is DOWN
  • 20:20 - Can still SSH into machines and they are running fine - just no one can talk to them
  • 20:39 - Updated with details of outage
  • 20:40 - Also notified clients via Slack support channels
  • 20:44 - Google acknowledge Stackdriver / AppEngine / Cloud Networking global outages
  • 20:44 - Pingdom alert arrives is UP
  • 20:52 - Connectivity is resumed and network traffic increases again
  • 21:06 - Update that issue is resolved and being monitored
  • 21:06 - Notify clients via Slack again
  • 21:42 - Update that everything is normal again.

Data Loss

We are unable to measure the amount of data loss. ​The API was almost totally unavailable, there was a small portion of traffic still passing through to the platform.

Root Cause

The root cause of the unavailability of the Ravelin API’s and dashboard was due to a global outage of Google’s load balancer in front of our GCP Compute machines. We are currently awaiting Google’s own post mortem on this incident, to help us understand how we can possibly mitigate this type of failure in the future.

Action Items

  • Improve time between initial alerts to updating status page & notifying clients
  • Investigate how we can make our platform more resilient to similar load balancer failures in the future (pending Google’s own post mortem)
Posted Jul 18, 2018 - 14:24 BST

This incident has been resolved
Posted Jul 17, 2018 - 21:44 BST
We are continuing to monitor for any further issues.
Posted Jul 17, 2018 - 21:06 BST
Our network connectivity issues appear to have been resolved by our cloud provider. We are continuing to monitor the platform carefully.
Posted Jul 17, 2018 - 20:59 BST
We are continuing to investigate this issue.
Posted Jul 17, 2018 - 20:44 BST
Clients are currently experiencing connectivity issues with our API's and Dashboard. Our engineers are currently working on a resolution to this issue. Please check here for further updates.
Posted Jul 17, 2018 - 20:39 BST
This incident affected: API and Dashboard.