This outage was caused by a component failure in the SAN for this THN stack which should have been capable of surviving with only a very minimal interruption to service.
In this case the live servers came up on warm standby hardware but due to the size of some of the images this took longer than we would have liked. We have identified the bottleneck and taken action across the entire platform to ensure that this automated failover process happens quicker in future for the handful of customers affected by the slow recovery.
The faulty hardware was immediately replaced the same day to ensure ongoing resiliency.
Apologies also for the delay in postmortem - I was away last week and wanted to make sure I fully understood the incident and evaluated the engineering team’s fix before publishing.