Network Issue THN
Incident Report for TelcoSwitch Group
Postmortem

This outage was caused by a component failure in the SAN for this THN stack which should have been capable of surviving with only a very minimal interruption to service.

In this case the live servers came up on warm standby hardware but due to the size of some of the images this took longer than we would have liked. We have identified the bottleneck and taken action across the entire platform to ensure that this automated failover process happens quicker in future for the handful of customers affected by the slow recovery.

The faulty hardware was immediately replaced the same day to ensure ongoing resiliency.

Apologies also for the delay in postmortem - I was away last week and wanted to make sure I fully understood the incident and evaluated the engineering team’s fix before publishing.

Posted Aug 20, 2019 - 11:41 BST

Resolved
This incident has been resolved.
Posted Aug 15, 2019 - 10:45 BST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 15, 2019 - 10:10 BST
Update
We have identified a hardware failure at THN. Automatic failover has taken place and all servers are coming back on-line.

All Servers on the THN stack should be live shortly.

We are pleased the failover worked as it should. However we apologise for any inconvenience. Our remote hand engineers we be on-site within the hour to swap out the affected hardware.
Posted Aug 15, 2019 - 10:08 BST
Identified
The issue has been identified. Failover in taking place to London VTA should take no more than 8 mins.
Posted Aug 15, 2019 - 09:51 BST
Investigating
We are currently investigating an outage at THN. Our engineers are looking at this Urgently.
Posted Aug 15, 2019 - 09:45 BST
This incident affected: CallSwitch (Hosted Telephony Platform).