Recently we discovered that calls being sent to us by BT, particularly to ported numbers, were increasingly including the destination number in a non-standard (i.e. invalid) format. These calls were as a result matching unexpected routing and causing our customers issues. However upon reporting this non-conformity to BT they confirmed they were unable/unwilling to fix since the format matched their own routing plan and they did not have the flexibility in their routing engine to accommodate a fix.
We therefore subsequently prepared a config change in order to accommodate the invalid numbers. Automated testing by replaying historical live call scenarios, and continuous deployment are standard practice for us and, following completion of that, we initiated the rollout to production during the afternoon of 14th November. The rollout was to each call routing instance, of which there are many in each of our 5 availability zones, watching channel/call levels closely for any signs of issues.
Part-way through the rollout, at 15.45, we were alerted to a number of inbound calls being rejected by some customers due to the RURI and To header in the outgoing INVITEs for those calls being truncated. We halted the rollout and rolled everything back to the previous state which remedied the situation, with normal state confirmed by 15.55.
Following further investigation it was discovered that in the config change and suite of tests we had failed to consider a particular routing scenario involving hosted ranges, applying to a very small number of customers. In all, around 2% of inbound calls across the entire network were affected, which made it extremely difficult to identify from the channel metrics, particularly as we were now intentionally rejecting the improper calls BT couldn’t/wouldn’t. Whilst the overall impact was very small and our Community Slack was uncharacteristically silent on the issue, the 15 customers affected by this issue on their hosted ranges, in some cases saw a much higher percentage of calls affected, depending on their individual traffic and configuration mix.
In terms of lessons learned by this incident, we do not believe that not making changes at all, as some would advocate, is a competent approach. Equally, we do not believe that deploying out of hours when some scenarios are absent, only to see issues the following day when they return, the deployment is complete, and attention has turned elsewhere is an acceptable approach - that assures bigger impact, later, and a slower response. Further, with a large distributed network our approach of progressive automated roll-out is one we defend over manual updates to monolithic instances. Thus, as is our standard practice when our test suite fails to accommodate a scenario, it is updated to do so, and this has been done. This enables us to continue to rapidly iterate with automated testing providing the assurance it has so far through thousands of deployments and absolute consistency around the network.
We’re sorry to those customers affected who, for the avoidance of doubt, had nothing “wrong” in their configuration at all. It was simply an edge case we’d missed, but which will now be tested automatically with every committed change in future.