BT circuits London
Incident Report for Simwood
Postmortem

At 10.34 this morning we noticed a high level of failures amongst calls to BT through our Telehouse London site. This corresponded with all circuits into all BT POIs from that site being down. This represented 50% of our capacity to/from BT. At 10.44 we received an email from BT's Traffic Management team reporting they were seeing the same. To be clear, this wasn't a single interconnect to a single BT POI but all of our interconnects to and from various BT POIs from that site.

A reset of circuits our side failed but a visual inspection (performed by Telehouse engineers) at 11.54 confirmed all was normal in terms of power and connectivity to BT. The issue had meanwhile been reported to BT's IC Repair team.

We operate our interconnects in a very different way to most, and no action was required on our part for traffic to fail over to Slough in either direction. This was thus a reduction in capacity and redundancy and non-service affecting in the basic sense.

Overall traffic was 12% higher than at the same time yesterday although this masks an underlying trend of outbound traffic (very little of which goes to BT) being 21% higher, and inbound traffic (a large proportion of which comes from BT) being 6% lower. We had headroom on our Slough BT interconnects to contain all traffic, as planned.

Some time later we became concerned that whilst there remained headroom on the interconnects it was not as much as we would like. Our system was automatically regulating outbound traffic to maintain room for BT's incoming traffic to us but the margin was unacceptably small.

We took the decision at 11.48 to disable the ability for accounts to burst beyond their allotted channel limit as described here. In other words, we enforced channel limits to ensure one customer could not affect another adversely by using more capacity than they'd been allocated. This meant that customers using beyond their allotted channel limit would have seen some of their traffic restricted. This is a capability that didn't exist prior to July 2016 so for the avoidance of doubt, all we did was reset that capability to how it was prior to that date. To be clear, all Primary and Backup capacity for Virtual Interconnect and Managed Interconnect customers was available, and all "best efforts" primary channel limits for Startup customers were available; merely the "below best efforts" traffic over and above channel limits was restricted. We strongly believe this was the fair thing to do to control the situation and infinitely better than a free-for-all. All traffic we'd committed to was serviced with 50% of our interconnects down.

At 12.40 we noticed that the interconnects were back in service and BT was passing traffic to us over them. On contacting them they believed the fault was still with their diagnostics team and hadn't been touched. An engineer was scheduled to look at it at 14.00 and he was instead re-tasked to diagnose what had happened. At 15.23 we received an RFO from BT that whilst only one sentence suggested to us that an SDH card had been replaced at a point in the BT network common to all our interconnects from that site, possibly incidentally to another fault. We followed up at 17.00 for more information but none was available.

We had been phasing traffic back onto the London interconnects since the all clear, and this included permitting "below best efforts" traffic, i.e. that beyond customer channel limits. This process was concluded at 15.33.

We'd again like to thank customers for recognising that losing 50% of our capacity to BT was quite extraordinary but that our design and procedures worked exactly as intended. We were able to honour every guarantee we'd made, honour all "best efforts" channel allocations and the only traffic that was partially impaired was that which was beyond agreed channel limits. Naturally, we're sorry for any inconvenience caused.

Posted Feb 24, 2017 - 18:58 UTC

Resolved
This incident is confirmed resolved. The RFO from BT is not exactly comprehensive and requires further conversations. We will post more in our own RFO.

We'd like to thank all customers for their understanding and compliments on how this was handled. Losing 50% of our capacity into/from BT is quite a big deal that could have been a major outage. The way the network is designed ensured it wasn't, and our hierarchical priority of channel allocations (http://blog.simwood.com/2016/07/relaxed-channel-limits/) ensured that all Reserved and Best Efforts capacity was honoured. A proportion of Below Best Efforts traffic (i.e. that above any allocated channel limit which we'd normally let pass if we could) was constrained. We consider this a far better outcome than the alternative.
Posted Feb 24, 2017 - 15:43 UTC
Update
Circuits appear to have remained stable. BT were scheduled to work on this job at 2pm but the engineer has instead been tasked with establishing what happened. We will confirm that and mark this incident resolved when we hear the outcome.
Posted Feb 24, 2017 - 14:19 UTC
Monitoring
We have seen our London circuits come back up. We await feedback from BT and will be monitoring before putting them back into service.
Posted Feb 24, 2017 - 12:44 UTC
Identified
We are seeing all our circuits facing BT from our London Telehouse site as down. This has been reported to BT and we will update this incident with feedback.

Due to our unique architecture this is not service affecting as calls from BT are seamlessly flowing via Slough, but it does however reduce our redundancy to n (from n+1) in respect of BT and our overall capacity to/from BT. Virtual Interconnect and Managed Interconnect customers have reserved capacity for this eventuality, and whilst there is presently adequate headroom customers without a commitment may see capacity constrained at times.
Posted Feb 24, 2017 - 11:05 UTC