Whilst we have very few credible examples here, and all of them demonstrate non-compliance with our interop, we have been able to investigate this issue and pushed some code changes.
We use anycast at every level of our stack with micro-services consumed by voice nodes anycasted and Redis nodes consumed by them similarly anycasted. It is, therefore, incumbent upon call-routing nodes to monitor the health of the services they're consuming, and fail-over should the IP address respond but the service not be available. We've found that at certain times, usually after the Redis master has performed a backup, the Redis slaves which are consumed by call-routing show a slight increase in latency. This increase in latency was slight (sub-second) but the tolerances before failover were too tightly set. This caused call-routing to return a failure response forcing a lookup against a backup [unicast] instance. However, this response was malformed but valid, causing the voice routing node to actually fail the call, and our edge proxy to try another. Further, that voice routing node would be taken out of service for a few seconds, causing something of a cascade which manifested in increased PDD.
In the first instance, we have pushed a change which prevents the trigger false positive here, i.e. more tolerance of latency increase, and a properly formed failure response. We have however tasked further improvements to prevent the increasing latency in the first place.
Lastly, we do need to highlight that this was only present in one particular site, which anyone conforming to our interop would not have been sending traffic to. The root cause of traffic ending up here in many cases appears to be very old versions of Asterisk which do not respect DNS TTL and will continue to cache a host-name until restarted. Others are hardcoding IP addresses. Whilst there was limited scope for some inbound calls to have been affected, customers who were sending outbound traffic according to our interop, using equipment which respects DNS record expiry, were unaffected.
Posted Sep 17, 2019 - 16:19 UTC
We are still investigating this but are pleased to say it was short-lived and we have no examples since 12.35BST. We have the grand sum of 7 example calls with PCAPs, after stripping out other issues such as invalid numbers or unrelated interop issues. We are working through those, our own telemetry and monitoring but so far, we have not found the cause. As an aside, all remaining example calls were forced to our London site, either owing to stale DNS or non-use of FQDNs.
Posted Sep 17, 2019 - 13:26 UTC
Whilst aggregate volumes look normal, some customers have reported high PDD or timeouts on certain calls. We're investigating.