At 10.34 this morning we noticed a high level of failures amongst calls to BT through our Telehouse London site. This corresponded with all circuits into all BT POIs from that site being down. This represented 50% of our capacity to/from BT. At 10.44 we received an email from BT's Traffic Management team reporting they were seeing the same. To be clear, this wasn't a single interconnect to a single BT POI but all of our interconnects to and from various BT POIs from that site.
A reset of circuits our side failed but a visual inspection (performed by Telehouse engineers) at 11.54 confirmed all was normal in terms of power and connectivity to BT. The issue had meanwhile been reported to BT's IC Repair team.
We operate our interconnects in a very different way to most, and no action was required on our part for traffic to fail over to Slough in either direction. This was thus a reduction in capacity and redundancy and non-service affecting in the basic sense.
Overall traffic was 12% higher than at the same time yesterday although this masks an underlying trend of outbound traffic (very little of which goes to BT) being 21% higher, and inbound traffic (a large proportion of which comes from BT) being 6% lower. We had headroom on our Slough BT interconnects to contain all traffic, as planned.
Some time later we became concerned that whilst there remained headroom on the interconnects it was not as much as we would like. Our system was automatically regulating outbound traffic to maintain room for BT's incoming traffic to us but the margin was unacceptably small.
We took the decision at 11.48 to disable the ability for accounts to burst beyond their allotted channel limit as described here. In other words, we enforced channel limits to ensure one customer could not affect another adversely by using more capacity than they'd been allocated. This meant that customers using beyond their allotted channel limit would have seen some of their traffic restricted. This is a capability that didn't exist prior to July 2016 so for the avoidance of doubt, all we did was reset that capability to how it was prior to that date. To be clear, all Primary and Backup capacity for Virtual Interconnect and Managed Interconnect customers was available, and all "best efforts" primary channel limits for Startup customers were available; merely the "below best efforts" traffic over and above channel limits was restricted. We strongly believe this was the fair thing to do to control the situation and infinitely better than a free-for-all. All traffic we'd committed to was serviced with 50% of our interconnects down.
At 12.40 we noticed that the interconnects were back in service and BT was passing traffic to us over them. On contacting them they believed the fault was still with their diagnostics team and hadn't been touched. An engineer was scheduled to look at it at 14.00 and he was instead re-tasked to diagnose what had happened. At 15.23 we received an RFO from BT that whilst only one sentence suggested to us that an SDH card had been replaced at a point in the BT network common to all our interconnects from that site, possibly incidentally to another fault. We followed up at 17.00 for more information but none was available.
We had been phasing traffic back onto the London interconnects since the all clear, and this included permitting "below best efforts" traffic, i.e. that beyond customer channel limits. This process was concluded at 15.33.
We'd again like to thank customers for recognising that losing 50% of our capacity to BT was quite extraordinary but that our design and procedures worked exactly as intended. We were able to honour every guarantee we'd made, honour all "best efforts" channel allocations and the only traffic that was partially impaired was that which was beyond agreed channel limits. Naturally, we're sorry for any inconvenience caused.