Network Issue

Incident Report for Simwood

Postmortem

Earlier this morning we became aware of degraded performance on VPLS services between Manchester and Slough. This was linked to packet-loss on one leg of our Manchester metro-ring which was involved in the western route from Manchester to Slough. Traffic was migrated from the affected ports to the eastern route which, whilst travelling via London, was loss-free and has far greater capacity.

We were unable to reroute affected LSPs and thus some VPLS services continued to flow this way. In an effort to force removal of stale routes an interface was disabled, but the routes were not removed and thus the affected LSPs lost paths.

We rely on OSPF between routers to advertise loopback addresses and LSPs (and iBGP) are configured between loopback addresses. It transpired that OSPF on one of our Manchester routers was failing to learn new routes, and advertising stale routes. OSPF was therefore restarted there, leading to the removal of many of the routes but not assisting in learning new ones. This lead to the severe network instability any customer with traffic transiting Manchester would have experienced and the loss of VPLS service between Slough and Manchester. It further transpired that the second router in Manchester was exhibiting the same behaviour and OSPF was also restarted there.

Given the instability we proceeded, wrongly with the benefit of hindsight, to restart OSPF around other routers on the network. This did not assist with stabilising the situation but caused further intermittent instability between other sites, without fundamentally re-stabilising things.

When we identified that Manchester routers were not learning OSPF routes we proceeded to add them manually, restoring stability progressively.

Manchester is now configured entirely without OSPF learning but the network is stable. Packet loss has been removed on the ring by cleaning fibres.

To fully overcome this will require scheduled maintenance and a reload of both routers in Manchester (sequentially). This will be performed out of hours and will be scheduled ASAP.

As to affected customers, an early advice suggested those using hostnames were unaffected and this was overly simplistic. Indeed, many customers were forcing traffic to the Manchester IP address and did not benefit from either SRV failover on their own equipment or our change of the A Record. However, whether or not other customers were affected depended on the extent to which they were transiting the network. Customers entering in Slough or London and having their call completed there were unaffected, whilst customers sending traffic to one of them but transiting Manchester in some way would have been. The rolling OSPF restarts would have affected transit between any two sites at the time affected routers were restarted. We again encourage customers to configure using the appropriate hostname, ensure your equipment supports SRV failover and choose a primary site that is the first you hit on our network - there is no value in transiting one site to force traffic to another although this kind of issue is highly extra-ordinary.

For our part, this has highlighted inadequacies in our internal monitoring that undoubtedly obscured the problem. We were consequently slower than we should have been in pin-pointing the source of the trouble in order to mitigate it. We operate multiple edge sites in order to prevent this kind of issue, and indeed haven't had one since 2010, but the fact an issue in one site can affect others by virtue of us operating a backbone in between does raise questions over the value of doing so and highlights the need for greater separation. The two routers in Manchester are running the same version of router firmware and this is distinct to that in other sites; we will investigate the possibility of a bug not present in other versions.

Our statistics suggest that completed call volumes during the affected period were approximately 40% lower than the same period yesterday. Some customers were unaffected, a few entirely affected and others saw intermittent behaviour.

We've prided ourselves on our stability and benefitted greatly from regular outages elsewhere. We therefore take this extremely seriously and will be ironing out the issues identified, probably involving a vendor change. Sorry again.

Posted Feb 17, 2016 - 19:34 UTC

Resolved

The network has been stable for a number of hours now and therefore the Manchester proxy has been re-enabled.

The comment below was overly simplistic in categorising affected customers, and the imminent post-mortem will go into greater detail about what occurred and further work required.

We apologise again to any customer affected.

Posted Feb 17, 2016 - 18:25 UTC

Update

It appears that customers configured as advised in our Interoperability Information (i.e. using hostnames, and sending traffic to hostnames) were unaffected.

We do note a number of customers were not configured this way, and sending traffic directly to IP addresses within the network. These customers will have been more severely impacted, as these calls were failing.

We would strongly recommend that customers configure their equipment to use either the generic FQDN, or a location specific one, as outlined in our interoperability information.

More information will be published as it becomes available.

Once again, we apologise for any inconvenience this caused and thank you for your patience.

Posted Feb 17, 2016 - 15:08 UTC

Monitoring

Normal service has now been restored, however Manchester remains running at reduced capacity whilst we investigate further.

Posted Feb 17, 2016 - 14:35 UTC

Identified

The fault appears to be confined to our Manchester site, traffic is being re-routed and normal service should be restored shortly.

Posted Feb 17, 2016 - 14:17 UTC

Update

As a result of the network issue please note that our Operations Desk number is unreachable. The Operations Desk can be contacted by raising a ticket via https://support.simwood.com/ or by eMail to team@simwood.com

We apologise for any inconvenience.

Posted Feb 17, 2016 - 13:42 UTC

Investigating

We are aware of a network issue that may result in some service degradation.

Engineers are investigating as a priority, and we will provide more information as soon as possible.

Please accept our apologies for any inconvenience caused.

Posted Feb 17, 2016 - 13:23 UTC