London - Intermittent call failures
Incident Report for Simwood
Resolved
This morning we experienced a brief issue affecting one of the nodes in our London site, whilst it continued handling calls we were unable to manage it and therefore withdrew it from service. As a precaution, we then redirected traffic from the London site to our site in Slough whilst we investigated the London issue.

Since introducing the new stack last year, this is the first time we've had cause to reroute traffic in this way outside of our own testing.

At the time, we were experiencing considerably higher volumes of traffic than normal and, whilst the Slough site can easily handle far more than even the London traffic in addition to its own usual load would create, some software-defined limits were tripped unexpectedly which were wrongly set at a level far lower than the capacity available.

As a result, some customers experienced unexpected call rejections. Additionally, as most customer equipment then retried almost immediately this, in turn, tripped some customer-set account and trunk rate limits unexpectedly which resulted in the subsequent attempts (which may otherwise have succeeded) being blocked also.

We are continuing to investigate the underlying cause of the issue in London, which was corrected within a few minutes, but changes to DNS did take longer to propagate in some cases.

We also noticed some customers continuing to send traffic directly to an IP address. We encourage all customers to use the FQDNs provided and, where possible to ensure that their platforms support the SRV records on these domains.

The configuration of the Slough site will be reviewed to ensure that the artificially low software-enforced limits are set more appropriately which will prevent any reoccurrence of this situation.

Additionally, we are looking to review how our own rate limits (intended for your own control and fraud-prevention) behave in a situation such as this, to reduce the risk of compounding the issue if calls are rejected for another reason.

Once again, we apologise for any inconvenience this caused and, as always, if you have any further questions please do not hesitate to get in touch.
Posted 7 months ago. Feb 26, 2018 - 12:00 UTC