Database cluster issues

Incident Report for Simwood

Resolved

Billing has fully caught up. Thanks for your patience.

Posted Oct 09, 2019 - 00:18 UTC

Monitoring

Failover is largely complete and CDRs are now being processed.

Posted Oct 08, 2019 - 16:46 UTC

Update

We are about to commence failover to the standby cluster as this query rollback is showing no signs of concluding.

Once this is concluded we'll mark this incident as 'monitoring'. There are several million CDRs to catch up on so we will leave it unresolved until they are processed.

Posted Oct 08, 2019 - 15:55 UTC

Update

This remains ongoing but we are making progress.

The offending query remains on one node and continues to be in the process of rolling back. Unfortunately, rolling back is less efficient than the problem it caused in the first place. Note this is not an issue with the query per-se (a single row delete) but an internal Galera issue triggered by it. Until this rollback completes the cluster remains effectively write locked but serviceable for reads.

We know why this happened and how to prevent it going forwards and have backup nodes with current data ready to takeover should we decide to fail-over from the existing cluster. As we have no idea whatsoever how long the trigger query will take to roll back on the final node, we have held off failing over in anger in the hope it may be soon, but cannot delay indefinitely.

Call traffic remains unaffected and our ops team have been handling most urgent customer issues such as locked balances. We will therefore continue monitoring and update here should anything change.

Posted Oct 08, 2019 - 12:05 UTC

Identified

Whilst not affecting call traffic, we are presently unable to write to our primary database cluster. This is due to an overnight job triggering a bug. The query will eventually work through but we have no way presently of determining how long that will take. We are meanwhile investigating more invasive options.

In the interim, this means portal, API and administration options which would normally update the database (e.g. billing, number allocation and pre-pay top-ups) are delayed or non-functional.

We're sorry for any impact this will have but, to repeat, call traffic is not affected.

Posted Oct 08, 2019 - 07:17 UTC

This incident affected: API and Portal and Operations Desk.