This site has been stable for several hours, and has now been returned to full normal service.
The cause of this issue was two severely loaded voice processing nodes which, whilst responding to health checks normally, were slower than normal responding to call set-ups. This caused increased PDD for calls hitting them, and greatly increased PDD in less frequent cases where subsequent failover was from one loaded node to the other. Once these were removed from service, new calls were successfully distributed across remaining nodes which were performing normally. The true root cause of the excess load is unknown, but given calls are load balanced across all nodes, and this was not capacity related, we suspect a bug in the underlying code.
This only affected some calls to our London node and DNS was rapidly failed over to other sites. However, we continued to see large flows of traffic to the London site regardless of the change, suggesting it was being forced to the specific IP address or customer equipment was not reflecting the DNS change. Customers are again reminded of our interop guidance which ensures an issue in a single site should not affect call completion.
Posted 5 months ago. Feb 05, 2019 - 16:36 UTC
Normal service in the London site was restored at 1005.
We are monitoring the situation and will restore normal DNS after testing is completed.
Posted 5 months ago. Feb 05, 2019 - 10:40 UTC
As this is limited to the London site, we have amended DNS to route outbound SIP traffic to alternative sites whilst investigation continues.
Posted 5 months ago. Feb 05, 2019 - 09:57 UTC
We are investigating reports of increased PDD, in some cases resulting in call timeouts, on outbound calls over our London site. We will update this status page as soon as further information becomes available.
Posted 5 months ago. Feb 05, 2019 - 09:47 UTC
This incident affected: Availability Zones (London) and Voice.