Simwood Status
All Systems Operational
Voice ? Operational
90 days ago
99.99 % uptime
Today
API and Portal ? Operational
90 days ago
99.47 % uptime
Today
Availability Zones Operational
90 days ago
99.98 % uptime
Today
London ? Operational
90 days ago
99.97 % uptime
Today
Slough ? Operational
90 days ago
99.97 % uptime
Today
Manchester ? Operational
90 days ago
100.0 % uptime
Today
Availability Zones (US) Operational
90 days ago
99.97 % uptime
Today
San Jose (US West) Operational
90 days ago
99.97 % uptime
Today
New York (US East) Operational
90 days ago
99.97 % uptime
Today
Operations Desk ? Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
had a major outage
had a partial outage
Past Incidents
Mar 20, 2019

No incidents reported today.

Mar 19, 2019

No incidents reported.

Mar 18, 2019

No incidents reported.

Mar 17, 2019
Resolved - This has been stable for 24 hours.
Mar 17, 13:39 UTC
Update - The cluster is fully operational now and being monitored.
Mar 16, 12:47 UTC
Update - Things should be starting to look more normal now.

The London node managed to sync before a repeat hardware failure. It has now been retired from service, which would have happened last night had it not been the most recently written database node in the cluster. A new node is synching now and will join the cluster shortly.

We will continue to monitor this as the root cause was of course the Slough node failing (in software) and restarting, which it could do again.

To repeat, call routing does not depend on the Galera database and was unaffected.
Mar 16, 11:20 UTC
Monitoring - The main Galera cluster is now partly-operational - recovered with zero data-loss on one node (Manchester). Slough has also recovered to the same point but is now acting as a donor for the recovered hardware in London which will be syncing for some time.

We therefore resumed processing against the cluster in this state 15 minutes ago, so customers will see API and portal features that require the database coming back to life. Please note though, we have a few million unwritten CDRs from overnight to process which is going to take some time.

The cluster remains at risk as a further failure would cause the operational node to lock again. Our restore to a new server continues but will hopefully be unnecessary.

Call routing remains unaffected as it does not depend on the Galera cluster.
Mar 16, 10:51 UTC
Update - Database recovery is going well. We have two streams of work:

1) recovering from backups to a new database server
2) recovering the Galera cluster.

In respect of the second, the affected hardware is back alive now and that node is waiting to join the cluster at the appropriate time.

Looking at progress we anticipate positive news in the next 2-3 hours.
Mar 16, 09:51 UTC
Update - Earlier this evening our Galera instance in Slough crashed and restarted, causing a full resync with the Manchester node which was therefore also locked and unavailable for use. The third London instance remained in service and would have been where the other two pulled interim updates from once they'd finished synching with each other.

Unfortunately at around 11pm the London node suffered a terminal hardware failure and went offline. The other two nodes remain in a state of syncing which is going to take at least overnight. We have therefore instigated the recovery of a database backup to a new machine, the backup being more recent than the time at which the Manchester node was locked.

We will leave this parallel process running now and take a view which option represents the favoured recovery point when we're in a position to.

Meanwhile, call routing remains unaffected but anything relying on the primary database such as CDRs or number reconfiguration is unfortunately unavailable.

This is an unprecedented and major issue and we appreciate your understanding.
Mar 16, 02:15 UTC
Identified - We have experienced a total failure of our Galera cluster. Call routing is unaffected but API and portal functions will be impaired.
Mar 15, 23:48 UTC
Mar 14, 2019
Resolved - This incident has been resolved.
Mar 14, 07:20 UTC
Update - We are continuing to monitor for any further issues.
Mar 14, 07:16 UTC
Monitoring - A fix has been implemented and we are monitoring the results.
Mar 14, 07:13 UTC
Identified - The issue has been identified and a fix is being implemented.
Mar 14, 07:08 UTC
Investigating - We are seeing disruption in New York which we are investigating. All DNS has automatically been swung to San Jose so those configured in accordance with our interop should see no service impact.
Mar 14, 07:01 UTC
Mar 13, 2019
Completed - The scheduled maintenance has been completed.
Mar 13, 00:00 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Mar 12, 21:00 UTC
Scheduled - We will be migrating our primary queue server and primary database node to new instances. This does not affect call routing and thus should not service, but it will briefly delay CDRs and new number allocations.
Mar 11, 17:23 UTC
Mar 11, 2019

No incidents reported.

Mar 10, 2019

No incidents reported.

Mar 9, 2019

No incidents reported.

Mar 8, 2019

No incidents reported.

Mar 7, 2019

No incidents reported.

Mar 6, 2019

No incidents reported.