At 11.08 today we were alerted to call volumes reporting as lower than normal and declining. We also began to receive reports of increased PDD and 503 call failures in our community Slack channel.
Our investigations later identified the root cause of these intermittent failures to be a result of excessive memory fragmentation on the master Redis node, causing increased latency and connection failures. Those connection failures caused slave nodes, which are distributed throughout the network and used for all read activities, to in some cases resynchronise. Two call-routing nodes, one in Slough and one in Volta, alerted intermittently increased PDD as a result of the (one of many) local nodes they were querying being in this unstable state. Other nodes and those in other sites were functional at this stage.
By 11.27, mid-investigations, our system automatically elected a new master node, and call volumes immediately began to climb reaching normal levels quite quickly. I hindsight, this was the primary issue mitigated.
However, by this time, we were seeing wider spread reports of 403 ‘out of call credit’ and ‘account not enabled’ errors, both for call traffic and in the portal. These were network wide and not just restricted to a few nodes. We realised that numerous accounts were marked as ‘credit blocked’ and by 11.54 had manually reset them, causing the remaining accounts to have successful calls again.
As our investigations continued, during which service was working normally, there was a second instance of some ‘credit locked’ accounts at 13.47. These were immediately corrected and this repeat assisted us in identifying the cause.
We subsequently discovered that one of the services responsible for monitoring calls in progress and disabling accounts had not failed over and was still connected to the old master, and thus continuing to experience connection difficulties.
A bug was identified which meant that should this (out of band) service fail to get a returned value for a particular key in Redis (rather than a negative result), the account was treated as disabled. ‘Disabled’ accounts with call attempts are blocked at a different level in our stack, to enable more efficient call rejection. This accounts for the progression of error messages some customers were seeing, i.e. they were first identified as disabled, then marked as blocked for no credit.
This calls in progress service was stopped as a precautionary measure, the bug was patched and the service was restarted. The incident did not then recur.
Technically speaking, service was degraded from 11.07 until 11.27, with a relatively small percentage of calls affected. However, as a number of accounts experienced complete network-wide call rejection until 11.54, starting at different times for each affected account, we are treating this as an SLA eligible incident from 11.07 until 11.54, and again, for 1 minute at 13.47. This grossly overstates the aggregate impact according to our statistics but we appreciate those accounts that were affected experienced a complete loss of service.
We’ve learned some useful lessons through this incident and will schedule remedial work to prevent a recurrence as soon as possible.
We’re sorry for the disruption caused here and very grateful to our community slack members who provided helpful insight to compliment our own telemetry which helped us identify a potentially elusive issue.