Intermittent outbound call failures via BT network

Incident Report for Simwood

Postmortem

Earlier this morning we noticed a substantial increase in traffic from a number of our larger customers. Call attempts were running at 7-8 times their normal level, causing these customers to hit their channel limits with us. Indications are this was due to a failure elsewhere at another provider.

Any call attempt generates several events within our system which feed through queues into various statistics such as channel limits and real-time calls in progress. Unfortunately, whilst we were rejecting excess calls and had ample actual capacity for the calls accepted a backlog of events developed. We were over 1m events behind in a short-space of time which resulted in channel counts etc. being inaccurate, i.e. behind.

Two consequences of this were a) we were failing to route calls to certain BT exchanges as capacity was being withheld to allow incoming calls, despite there being ample actual capacity and b) some customers were seeing calls rejected due to being over their channel limit despite actually not being. UK geographic and mobile calls were less affected, NGN and some international calls more-so.

Despite increasing the daemons that handle events to more than 10 times their usual (ample level) and having 600 running across sites, the backlog continued to grow.

We took other measures to reduce the number of attempts from key culprits and to remove the artificially high channel limits enabling traffic to flow. This resulted in intermittent behaviour for a short time but progressively restored service. However, event backlogs continued to grow.

At this stage the queues had grown so large they were causing their host operating systems to swap to disk, substantially slowing performance. These servers were moved to pure direct SSD storage (from grid based SSD storage) to improve IO but queues continued to grow to unprecedented levels.

As a final measure we created new queue hosts with substantially elevated hardware and directed new events to these. This immediately remedied the situation.

This therefore appears to have been a cosmetic issue in that all capacity was in-tact but we were rejecting calls because it appeared utilised, due to our failure to process a substantially higher level of events in real-time, due to substantially elevated call attempts. Moving forward, events are crucial to our architecture and we were able to substantially increase processing capacity, both in-line and to add new queue servers when that wasn't sufficient. We will work to permanently increase in-line queue capacity so the head-room is more than the 10-fold we had but more importantly in a forthcoming refresh of some of our architecture we will look to handle rejected calls higher in the stack, thus reducing the underlying cause of this issue.

Our apologies for this incident and we're very grateful to those customers who did not telephone and kept an eye on this page.

Posted Aug 13, 2015 - 13:00 UTC

Resolved

This issue has now been resolved, engineers will continue to monitor the situation.

We will also publish a post-mortem shortly

Posted Aug 13, 2015 - 11:59 UTC

Update

Please note that realtime stats on the portal and API will be inaccurate during this incident - these will start to return to normal shortly.

Posted Aug 13, 2015 - 11:26 UTC

Identified

Engineers have identified the underlying issue and are working on a resolution.

Changes have been made to ensure that calls flow as smoothly as possible and you should no longer be experiencing call failures - however the underlying issue still remains and is being worked on.

Please continue to monitor this status page for updates rather than calling our Operations Desk.

Posted Aug 13, 2015 - 11:20 UTC

Investigating

We are noticing intermittent call failures on outbound calls sent via the BT network.

Our engineers are investigating at this time.

Posted Aug 13, 2015 - 09:44 UTC