Elevated PDD

Incident Report for Simwood

Resolved

At 14h44 we were notified of the failure of a primary database node in London (Volta). This is a planned failure scenario and as designed service failed over cleanly to a candidate replacement in Slough (LD4). At 14h50 our call monitoring reported increased PDD (Post Dial Delay) from some parts of the network. This was owing to several call-routing nodes which were previously slaves to the failed master resyncing, and thus being unavailable for service. In this scenario, call-routing fails over to other back-up instances, which it did. Depending on the precise local state at the time of the call this can increase PDD. The first node had resynced by 14h58 and by 15h03 the last node had fully resynced and PDD had returned to normal levels everywhere. Our monitoring shows that less than 15% of calls network wide were impacted by increased PDD but customer experiences may vary according to their own timers and failover protocols. We are however investigating utilisation of the backup routing instances which, whilst not experience affecting, was not as evenly distributed as designed.

Posted Feb 22, 2023 - 15:00 UTC