Galera cluster failure

Incident Report for Simwood

Resolved

This has been stable for 24 hours.

Posted Mar 17, 2019 - 13:39 UTC

Update

The cluster is fully operational now and being monitored.

Posted Mar 16, 2019 - 12:47 UTC

Update

Things should be starting to look more normal now.

The London node managed to sync before a repeat hardware failure. It has now been retired from service, which would have happened last night had it not been the most recently written database node in the cluster. A new node is synching now and will join the cluster shortly.

We will continue to monitor this as the root cause was of course the Slough node failing (in software) and restarting, which it could do again.

To repeat, call routing does not depend on the Galera database and was unaffected.

Posted Mar 16, 2019 - 11:20 UTC

Monitoring

The main Galera cluster is now partly-operational - recovered with zero data-loss on one node (Manchester). Slough has also recovered to the same point but is now acting as a donor for the recovered hardware in London which will be syncing for some time.

We therefore resumed processing against the cluster in this state 15 minutes ago, so customers will see API and portal features that require the database coming back to life. Please note though, we have a few million unwritten CDRs from overnight to process which is going to take some time.

The cluster remains at risk as a further failure would cause the operational node to lock again. Our restore to a new server continues but will hopefully be unnecessary.

Call routing remains unaffected as it does not depend on the Galera cluster.

Posted Mar 16, 2019 - 10:51 UTC

Update

Database recovery is going well. We have two streams of work:

1) recovering from backups to a new database server
2) recovering the Galera cluster.

In respect of the second, the affected hardware is back alive now and that node is waiting to join the cluster at the appropriate time.

Looking at progress we anticipate positive news in the next 2-3 hours.

Posted Mar 16, 2019 - 09:51 UTC

Update

Earlier this evening our Galera instance in Slough crashed and restarted, causing a full resync with the Manchester node which was therefore also locked and unavailable for use. The third London instance remained in service and would have been where the other two pulled interim updates from once they'd finished synching with each other.

Unfortunately at around 11pm the London node suffered a terminal hardware failure and went offline. The other two nodes remain in a state of syncing which is going to take at least overnight. We have therefore instigated the recovery of a database backup to a new machine, the backup being more recent than the time at which the Manchester node was locked.

We will leave this parallel process running now and take a view which option represents the favoured recovery point when we're in a position to.

Meanwhile, call routing remains unaffected but anything relying on the primary database such as CDRs or number reconfiguration is unfortunately unavailable.

This is an unprecedented and major issue and we appreciate your understanding.

Posted Mar 16, 2019 - 02:15 UTC

Identified

We have experienced a total failure of our Galera cluster. Call routing is unaffected but API and portal functions will be impaired.

Posted Mar 15, 2019 - 23:48 UTC

This incident affected: API and Portal.