Earlier this morning we became aware of degraded performance on VPLS services between Manchester and Slough. This was linked to packet-loss on one leg of our Manchester metro-ring which was involved in the western route from Manchester to Slough. Traffic was migrated from the affected ports to the eastern route which, whilst travelling via London, was loss-free and has far greater capacity.
We were unable to reroute affected LSPs and thus some VPLS services continued to flow this way. In an effort to force removal of stale routes an interface was disabled, but the routes were not removed and thus the affected LSPs lost paths.
We rely on OSPF between routers to advertise loopback addresses and LSPs (and iBGP) are configured between loopback addresses. It transpired that OSPF on one of our Manchester routers was failing to learn new routes, and advertising stale routes. OSPF was therefore restarted there, leading to the removal of many of the routes but not assisting in learning new ones. This lead to the severe network instability any customer with traffic transiting Manchester would have experienced and the loss of VPLS service between Slough and Manchester. It further transpired that the second router in Manchester was exhibiting the same behaviour and OSPF was also restarted there.
Given the instability we proceeded, wrongly with the benefit of hindsight, to restart OSPF around other routers on the network. This did not assist with stabilising the situation but caused further intermittent instability between other sites, without fundamentally re-stabilising things.
When we identified that Manchester routers were not learning OSPF routes we proceeded to add them manually, restoring stability progressively.
Manchester is now configured entirely without OSPF learning but the network is stable. Packet loss has been removed on the ring by cleaning fibres.
To fully overcome this will require scheduled maintenance and a reload of both routers in Manchester (sequentially). This will be performed out of hours and will be scheduled ASAP.
As to affected customers, an early advice suggested those using hostnames were unaffected and this was overly simplistic. Indeed, many customers were forcing traffic to the Manchester IP address and did not benefit from either SRV failover on their own equipment or our change of the A Record. However, whether or not other customers were affected depended on the extent to which they were transiting the network. Customers entering in Slough or London and having their call completed there were unaffected, whilst customers sending traffic to one of them but transiting Manchester in some way would have been. The rolling OSPF restarts would have affected transit between any two sites at the time affected routers were restarted. We again encourage customers to configure using the appropriate hostname, ensure your equipment supports SRV failover and choose a primary site that is the first you hit on our network - there is no value in transiting one site to force traffic to another although this kind of issue is highly extra-ordinary.
For our part, this has highlighted inadequacies in our internal monitoring that undoubtedly obscured the problem. We were consequently slower than we should have been in pin-pointing the source of the trouble in order to mitigate it. We operate multiple edge sites in order to prevent this kind of issue, and indeed haven't had one since 2010, but the fact an issue in one site can affect others by virtue of us operating a backbone in between does raise questions over the value of doing so and highlights the need for greater separation. The two routers in Manchester are running the same version of router firmware and this is distinct to that in other sites; we will investigate the possibility of a bug not present in other versions.
Our statistics suggest that completed call volumes during the affected period were approximately 40% lower than the same period yesterday. Some customers were unaffected, a few entirely affected and others saw intermittent behaviour.
We've prided ourselves on our stability and benefitted greatly from regular outages elsewhere. We therefore take this extremely seriously and will be ironing out the issues identified, probably involving a vendor change. Sorry again.