We're happy that everything is once again ticking along as it usually should.
Our first follow-up action to this incident will be to decrease the 400ms retry timeouts between BigTable clusters. This was the primary source of latency during this incident. And our second follow-up action will be to consider escalating the severity of our API response time alerts to notify engineers earlier.
Resolved
We're happy that everything is once again ticking along as it usually should.
Our first follow-up action to this incident will be to decrease the 400ms retry timeouts between BigTable clusters. This was the primary source of latency during this incident. And our second follow-up action will be to consider escalating the severity of our API response time alerts to notify engineers earlier.
Monitoring
Five minutes ago we promoted our secondary BigTable cluster to be the primary, and it's happily been chewing through the write operations we had queued up. These queues are now up-to-date and the write response time across API requests has reverted to its usual as of 17:41:30 UTC.
Investigating
Our primary BigTable cluster is responding slowly, resulting in elevated API times while we retry to the secondary cluster. The error rate remains low for connections provided your client has timed out waiting for a response.