Network Issue THN

Incident Report for TelcoSwitch Group

Postmortem

This outage was caused by a component failure in the SAN for this THN stack which should have been capable of surviving with only a very minimal interruption to service.

In this case the live servers came up on warm standby hardware but due to the size of some of the images this took longer than we would have liked. We have identified the bottleneck and taken action across the entire platform to ensure that this automated failover process happens quicker in future for the handful of customers affected by the slow recovery.

The faulty hardware was immediately replaced the same day to ensure ongoing resiliency.

Apologies also for the delay in postmortem - I was away last week and wanted to make sure I fully understood the incident and evaluated the engineering team’s fix before publishing.

Posted Aug 20, 2019 - 11:41 BST

Resolved

This incident has been resolved.

Posted Aug 15, 2019 - 10:45 BST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Aug 15, 2019 - 10:10 BST

Update

We have identified a hardware failure at THN. Automatic failover has taken place and all servers are coming back on-line.

All Servers on the THN stack should be live shortly.

We are pleased the failover worked as it should. However we apologise for any inconvenience. Our remote hand engineers we be on-site within the hour to swap out the affected hardware.

Posted Aug 15, 2019 - 10:08 BST

Identified

The issue has been identified. Failover in taking place to London VTA should take no more than 8 mins.

Posted Aug 15, 2019 - 09:51 BST

Investigating

We are currently investigating an outage at THN. Our engineers are looking at this Urgently.

Posted Aug 15, 2019 - 09:45 BST

This incident affected: CallSwitch (Hosted Telephony Platform).