Elevated BigTable Error Rate

Resolved

This incident has been resolved.

Mon, Apr 12, 2021, 07:12 AM

(4 years ago)

Affected components

No components marked as affected

Updates

Resolved

This incident has been resolved.

Mon, Apr 12, 2021, 07:12 AM

Monitoring

Normal operation has resumed.

Additional BigTable load was generated due to the queue retry behaviour of a data replay that has been running today. Ravelin uses data replays to patch and upload data, and they are a part of normal operation. The data requests are batched per customer for ordering purposes. This evening we have encountered a large enough batch that the queue handler timed out because we are not pinging it mid-batch. This automatically puts the entire batch back on the queue to be processed again. We believe the BigTable CPU spike and subsequent API response errors to be a result of this loop.

We will resume the replay once there is enough capacity for it to succeed. Tomorrow we will investigate pinging the message queue to avoid timing out mid-batch, and only re-queuing the parts of a batch yet to be processed.

Sun, Apr 11, 2021, 07:12 PM(12 hours earlier)

Investigating

With another small spike between 1941 and 1943 BST we are investigating the cause of BigTable CPU spikes correlating with the time these errors occurred. Scoring times remain elevated. We have paused background data cleaning operations to reduce load.

Sun, Apr 11, 2021, 06:47 PM(25 minutes earlier)

Monitoring

There was an elevated error rate on the API causing a spike in 500s between 1934 and 1938 BST and another beginning now. This appears to correlate with a spike in BigTable CPU usage which we are investigating.

Sun, Apr 11, 2021, 06:41 PM