Network connectivity
Incident Report for Ravelin
Postmortem

Google Load Balancer Outage

From - Tuesday 17th July 2018, 20:14 UTC
Until - Tuesday 17th July 2018, 20:52 UTC

Timeline

Tuesday 17th July 2018 (all times are UTC)
  • 18:58 - Triggered live 500s errors alert
  • 19:04 - 500s alert recovered
  • 20:15 - Requests passing through load balancers to our APIs nosedive
  • 20:18 - Triggered live 500s errors alert
  • 20:18 - On call engineer gets paged, Slack alerts summon other engineers as well
  • 20:18 - Initially Google status page is only reporting an earlier issue with Stackdriver metrics
  • 20:18 - Getting 502’s from Grafana, Prometheus, Ravelin API & Ravelin dashboard
  • 20:19 - Pingdom alert arrives api.ravelin.com is DOWN
  • 20:20 - Can still SSH into machines and they are running fine - just no one can talk to them
  • 20:39 - Updated statuspage.io with details of outage
  • 20:40 - Also notified clients via Slack support channels
  • 20:44 - Google acknowledge Stackdriver / AppEngine / Cloud Networking global outages
  • 20:44 - Pingdom alert arrives api.ravelin.com is UP
  • 20:52 - Connectivity is resumed and network traffic increases again
  • 21:06 - Update statuspage.io that issue is resolved and being monitored
  • 21:06 - Notify clients via Slack again
  • 21:42 - Update statuspage.io that everything is normal again.

Data Loss

We are unable to measure the amount of data loss. ​The API was almost totally unavailable, there was a small portion of traffic still passing through to the platform.

Root Cause

The root cause of the unavailability of the Ravelin API’s and dashboard was due to a global outage of Google’s load balancer in front of our GCP Compute machines. https://status.cloud.google.com/incident/cloud-networking/18012 We are currently awaiting Google’s own post mortem on this incident, to help us understand how we can possibly mitigate this type of failure in the future.

Action Items

  • Improve time between initial alerts to updating status page & notifying clients
  • Investigate how we can make our platform more resilient to similar load balancer failures in the future (pending Google’s own post mortem)
Posted 9 months ago. Jul 18, 2018 - 14:24 BST

Resolved
This incident has been resolved
Posted 9 months ago. Jul 17, 2018 - 21:44 BST
Update
We are continuing to monitor for any further issues.
Posted 9 months ago. Jul 17, 2018 - 21:06 BST
Monitoring
Our network connectivity issues appear to have been resolved by our cloud provider. We are continuing to monitor the platform carefully.
Posted 9 months ago. Jul 17, 2018 - 20:59 BST
Update
We are continuing to investigate this issue.
Posted 9 months ago. Jul 17, 2018 - 20:44 BST
Investigating
Clients are currently experiencing connectivity issues with our API's and Dashboard. Our engineers are currently working on a resolution to this issue. Please check here for further updates.
Posted 9 months ago. Jul 17, 2018 - 20:39 BST
This incident affected: API, API Vault, and Dashboard.