Database cluster issues
Incident Report for Simwood
Resolved
Billing has fully caught up. Thanks for your patience.
Posted 7 days ago. Oct 09, 2019 - 00:18 UTC
Monitoring
Failover is largely complete and CDRs are now being processed.
Posted 7 days ago. Oct 08, 2019 - 16:46 UTC
Update
We are about to commence failover to the standby cluster as this query rollback is showing no signs of concluding.

Once this is concluded we'll mark this incident as 'monitoring'. There are several million CDRs to catch up on so we will leave it unresolved until they are processed.
Posted 7 days ago. Oct 08, 2019 - 15:55 UTC
Update
This remains ongoing but we are making progress.

The offending query remains on one node and continues to be in the process of rolling back. Unfortunately, rolling back is less efficient than the problem it caused in the first place. Note this is not an issue with the query per-se (a single row delete) but an internal Galera issue triggered by it. Until this rollback completes the cluster remains effectively write locked but serviceable for reads.

We know why this happened and how to prevent it going forwards and have backup nodes with current data ready to takeover should we decide to fail-over from the existing cluster. As we have no idea whatsoever how long the trigger query will take to roll back on the final node, we have held off failing over in anger in the hope it may be soon, but cannot delay indefinitely.

Call traffic remains unaffected and our ops team have been handling most urgent customer issues such as locked balances. We will therefore continue monitoring and update here should anything change.
Posted 7 days ago. Oct 08, 2019 - 12:05 UTC
Identified
Whilst not affecting call traffic, we are presently unable to write to our primary database cluster. This is due to an overnight job triggering a bug. The query will eventually work through but we have no way presently of determining how long that will take. We are meanwhile investigating more invasive options.

In the interim, this means portal, API and administration options which would normally update the database (e.g. billing, number allocation and pre-pay top-ups) are delayed or non-functional.

We're sorry for any impact this will have but, to repeat, call traffic is not affected.
Posted 8 days ago. Oct 08, 2019 - 07:17 UTC
This incident affected: API and Portal and Operations Desk.