Additional BigTable load was generated due to the queue retry behaviour of a data replay that has been running today. Ravelin uses data replays to patch and upload data, and they are a part of normal operation. The data requests are batched per customer for ordering purposes. This evening we have encountered a large enough batch that the queue handler timed out because we are not pinging it mid-batch. This automatically puts the entire batch back on the queue to be processed again. We believe the BigTable CPU spike and subsequent API response errors to be a result of this loop.
We will resume the replay once there is enough capacity for it to succeed. Tomorrow we will investigate pinging the message queue to avoid timing out mid-batch, and only re-queuing the parts of a batch yet to be processed.
Posted Apr 11, 2021 - 20:12 BST
Investigating
With another small spike between 1941 and 1943 BST we are investigating the cause of BigTable CPU spikes correlating with the time these errors occurred. Scoring times remain elevated. We have paused background data cleaning operations to reduce load.
Posted Apr 11, 2021 - 19:47 BST
Monitoring
There was an elevated error rate on the API causing a spike in 500s between 1934 and 1938 BST and another beginning now. This appears to correlate with a spike in BigTable CPU usage which we are investigating.