Intermittent Calls Failing
Incident Report for Simwood
Postmortem

At 1809 tonight, a large object was routinely deleted from our master Redis instance. Whilst routine this was considerably larger than normal and caused the master to lock for a few seconds. This wasn't an issue in itself, as other writes are gracefully queued and any reads take place against many, many, anycasted slave instances around the network.

This lock unusually triggered an automated failover of the master to another instance, which again, is a well-tested process. All slaves were instructed to reconnect and again, this is a well-tested automated process. Slaves are designed to continue serving requests even if the master is completely unavailable, and use partial replication for rapid non-blocking synchronisation so ordinarily this event would have been handled with no ill effects.

However, in last weekend's maintenance, which involved a complete rebuild of our Slough site, new containers were deployed. Again this is routine and happens continuously throughout the day. What was different though was that due to a configuration error a current rather than fixed version of the particular Redis container was deployed, effectively replacing a late v4 instance with v5.

In the election of a new master, this v5 instance was selected. The result of this was the vast majority of slave nodes around the world were unable to reconnect and went into a cycle of trying to reconnect, downloading current database snapshot, then failing to install it. Our well-tested recovery process involves a recovery of the database from off-net storage and reinstallation, but this failed too owing to the version change.

Whilst the master remained available, as did three slave nodes, the vast majority were in a state of being on-line and available for queries but in an errored state. This caused any service that relies on querying a local Redis instance (e.g. call routing, registration proxy, portal, and API) to error in a large number of cases. Whilst all of these are designed to failover, they were failing over to other Redis instances that were in many cases in the same state. Our Manchester AZ was unaffected but the nature of the errors in other sites prevented it being used other than where actively targeted by customer configurations.

Stability and normal service was restored in the UK at 1840 by stopping errored Redis instances. Due to anycast, this caused queries for most services to flow to remaining working slaves. Having successfully ran our v4 test suite against v5, new instances were intentionally deployed to restore the quantity to the usual high level. One service, portal authentication, required a restart of some of its instances in order to force failover - a bug which has been logged for offline attention.

The US was similarly affected but has a different restoration procedure, largely due to the added latency. It was initially resolved by way of DNS update at 1915 whilst service from San Jose was restored to normal levels of redundancy by 2014 and New York subsequently.

Whilst this was not a total outage mathematically, we are treating it as such and 100% SLA payments will be made for the 30 minutes between 1809 and 1840. We are confident in our well-practiced processes for master failure and believe they worked well here - had it not been for the unintended version change this would have caused no service issue and would have been handled in the ordinary course of automated operations. The configuration error, together with deployment of a new master candidate last weekend, coinciding with the election of that master, triggered the slave failures. We will take immediate action to ensure that container is version locked and this cannot repeat. We will also undertake a review of the rest of our container estate to ensure that a similar problem due to an inadvertent version change cannot occur elsewhere.

Apologies to all affected and thank you for your understanding.

Posted Jan 30, 2019 - 22:44 UTC

Resolved
We are confident this issue is resolved and will post an RFO shortly.
Posted Jan 30, 2019 - 22:26 UTC
Monitoring
We are continuing to monitor and will provide an RFO in the morning
Posted Jan 30, 2019 - 20:21 UTC
Update
Traffic is flowing again in the US as of 19:15, we are continuing to investigate
Posted Jan 30, 2019 - 19:51 UTC
Update
We have deployed a fix which has resolved the issue in the UK as of 18:40 and are working to deploy in our US sites
Posted Jan 30, 2019 - 19:22 UTC
Investigating
We are investigating sporadic reports of calls failing, we are investigating this and more information will be provided as soon as it becomes available.
Posted Jan 30, 2019 - 18:38 UTC
This incident affected: Availability Zones (London, Slough, Manchester).