Elevated Error Rates

Resolved

There has been no indication of further issues since 8th October.

Tue, Oct 13, 2020, 09:33 AM

(5 years ago)

Affected components

No components marked as affected

Updates

Resolved

There has been no indication of further issues since 8th October.

Tue, Oct 13, 2020, 09:33 AM

Monitoring

The same as yesterday: 0540 Thursday 8th October remains the last time that we exceeded 0.3% erroring requests per minute, when we peaked at 0.7%. We have had a further update reflecting that our support ticket is being looked into, and in the mean time we have continued digging through our own logs and metrics in the hopes of finding something amiss for us to bring closure to the intermittent timeouts we experienced.

As we are yet to hear back from Google on our support ticket, we will keep this incident as Monitoring and update around the same time tomorrow or sooner if more information becomes available.

Sat, Oct 10, 2020, 05:57 PM(2 days earlier)

Monitoring

0540 Thursday 8th October remains the last time that we exceeded 0.3% erroring requests per minute, when we peaked at 0.7%. We still haven't heard back from Google on our support ticket since our last update. With a day and a half passed it seems justifiable to say that our APIs are all operational, but we will keep this incident open until we have something more conclusive to say.

Fri, Oct 9, 2020, 05:36 PM(1 day earlier)

Investigating

Overnight we had no call, and while we had a very small trickle of 500s and timeouts, things continued functioning. At 0630 and 1030 BST there have been elevated response times, but the uptick in 500s from those was very minor: our optimistic retries and failing over to our secondary cluster handled those as they usually would.

The last time that the error rate exceeded 0.3% was at 0540 Thursday 8th October when we peaked at 0.7% over one minute.

There's still a couple of outstanding items on our todo list as we continue to monitor these timeouts, and Google have updated our support ticket to let us know they have elevated the incident again, and are continuing to investigate on their side. We're going to keep this issue open as Investigating to reflect this.

Fri, Oct 9, 2020, 09:56 AM(7 hours earlier)

Investigating

The last two times we've marked this incident as resolved feel slightly rushed with hindsight. We're going to keep the incident open until we hear back from Google, and until we've had a longer period of time without recurrence. Overnight, the on-call engineer will be paged if the rate of 500s exceeds 2% for 5 minutes. Otherwise we will provide our next update in the morning to assess what happens overnight. In the mean time we'll move this incident to Monitoring so you can see when we start digging into it again.

Please do continue to send us your observations of timeouts and error rates: these are all useful information in monitoring this issue.

Thu, Oct 8, 2020, 05:23 PM(16 hours earlier)

Investigating

New request-routing logic was deployed at 1:11pm BST to steer traffic away from bad services should the incident reoccur. Next update will be to add more diagnostics around the gRPC connection lifecycles. Affected components have been updated to include pci.ravelin.com and the Ravelin Dashboard, as BigTable slow-downs can affect these too.

Thu, Oct 8, 2020, 12:32 PM(4 hours earlier)

Investigating

A continuation of the occasional burst of 500s being returned on our API: https://status.ravelin.com/incidents/pkr07psskgf2. Since our last update we have observed bursts of 500s returned from the API at a rate of 0.4% per minute at 2245 BST last night, and 0.55-0.76% per minute between 0545 and 055 BST this morning, and 0.035%. at 1130 BST.

These 500s are the results of timeouts with our connection to Google BigTable, which we are continue to monitor. We are working with Google to investigate the matter further. In the mean time, we are adapting our internal request-routing logic to better identify which services are experiencing elevated error rates so that we can steer traffic to healthy services. We expect this to mitigate the impact of these timeouts while we investigate.

Thu, Oct 8, 2020, 10:53 AM(1 hour earlier)