Some Customers experiencing Connectivity issues.

Incident Report for TelcoSwitch Group

Postmortem

Message from our CTO, Dan Lane.

As CTO the remit of the network reliability and providing a continuous service falls to me, and this past week has obviously not been up to the standards I'd expect. I'm going to give you an honest account of what's happened with no excuses, shallow promises or marketing speak, if you would like clarification on anything please don't hesitate to ask me questions directly.

Unfortunately this morning we had another outage on our main connection there is a legacy managed firewall on the local end that was due to be replaced as part of the larger work of replacing the entire connection. The supplier of that firewall issued a command to update the password for ISO compliance and the firewall lost it's entire configuration - I'm awaiting further reports of how this is even possible and why that maintenance was taking place during working hours on a weekday without any warning. Once again an engineer was dispatched to the datacentre where there is a warm redundant standby for that firewall. It appears that the configuration on the warm standby wasn't kept up to date with the primary so it took considerably longer to restore service from backup. Although this firewall had a resilient warm standby it had been identified as a piece of legacy equipment not under our direct control and replacement equipment had been ordered to introduce further resiliency and control.

Until the equipment arrives and is installed to replace that firewall I have the following assurances from the managing director of the supplier:

I have given instructions that there should be no routine maintenance within working hours in the future, ever.
We will ensure that the 2nd firewall is powered up at all times, always with the latest config, so that the cables can be connected in an emergency.
We will also keep a USB key onsite with the latest config.

So to summarise: we had a critical resilient link that was dependent on a supplier who performed routine maintenance that failed during business hours without notifying us in advance. While the link is resilient the supplier was not - we had already identified this as an area of concern and was in the process of replacing it with a more independent solution that would ensure this issue would not happen.

Ultimately the blame lies with us for blindly trusting that this large well-known supplier would competently perform the tasks we paid them to perform, the project to reduce dependence on a single vendor and supplier is well underway as well as some temporary measures to ensure extra resiliency while that work takes place. We already have strict maintenance policies internally and as a matter of caution I will be reviewing them to ensure they are adhered to.

So while we've had a very rough week, and I understand if your confidence in us may have dropped. I hope this explanation goes some way to restoring your faith in us as a company that is dedicated to providing a reliable service as we move forward.

If you would like to discuss anything about these outages or the impending upgrades to the network I'm very happy to schedule a phone call or come out and see you.

Posted Jul 13, 2017 - 15:10 BST

Resolved

The issue has been resolved. We are currently writing a full postmortem which will be posted here within 2 hours.

Posted Jul 13, 2017 - 13:18 BST

Monitoring

Services have been up and running for the past 20 mins.
We are monitoring the network. A full postmortem follows.

Posted Jul 13, 2017 - 12:36 BST

Identified

We have identified a hardware failure on one of our Juniper Firewalls. This is being replaced with the cold standby.

Our engineers are at the London Volta now.

Posted Jul 13, 2017 - 12:09 BST

Investigating

Our Network engineers are looking into this now.

Posted Jul 13, 2017 - 11:24 BST