Irregular traffic appears to have resulted in a hot key in our database, which caused one of our database nodes to become unresponsive. We scaled up the number of nodes to limit the impact of that node. Our failover and retry logic was able to satisfy the majority of requests, significantly lowering the error rate by 19:34 BST, albeit with elevated response times. The error rates and response times returned to zero by 20:30. We are continuing to investigate the cause and mitigations for this performance issue.
Posted 4 months ago. May 30, 2019 - 21:10 BST
We are investigating increased response times and error rates beginning 19:30 BST. Initial indications are that these are coming from an increased error rate communicating with our primary storage, which had a spike in response times, but is beginning to return to normal as we process the backlog of requests queued for retry. We are investigating the initial cause of this spike now and continuing to monitor.