Microsoft has published information about the causes of the failure of its cloud-based services. Due to an outage in Microsoft’s Azure Cloud, a large number of users were unable to access applications and services hosted on this platform on Wednesday morning, January 25th. These included the widespread collaboration tool Teams, but also other Microsoft 365 applications such as Outlook, Word, Excel did not work in their cloud-based variants.
Between 08:05 and 13:43 GMT on Wednesday, customers experienced connectivity issues resulting in high latency, packet loss and timeouts when accessing Azure cloud resources. On the day of the outage, Microsoft initially only named a network change as the cause of the outage. To fix this, it was rolled back. A preliminary post-incident report from Microsoft now provides further details.
The cause was a planned change to a WAN router. According to the information from the manufacturer from Redmond, an IP address on the router should be changed. The command sent to the router for this led to messages being sent to all routers in the WAN. This led to a recalculation of forwarding information (adjacency and forwarding tables) on the control plane. Microsoft does not mention whether these were regular BGP updates. During this recalculation, the routers could not correctly forward the packets flowing through. The preliminary report does not yet reveal whether there was only a load problem or even incorrect routing.
Error due to lack of quality control, exemplary response
The root of the command that caused the problem behaves differently on different routers. It hadn’t gone through the full qualification process on the router platform it was running on, a classic failure of network automation quality control. However, not only north/south traffic between clients and Azure was affected, but also connectivity between Azure regions and connections via ExpressRoute.
However, the company’s response has been exemplary. Microsoft noticed the DNS and WAN errors just seven minutes after the failure and carried out a review of the changes made previously. An automated recovery process began in the network about an hour after the start. The last network component resumed its function at 10:35 a.m. However, due to the WAN failure, automation systems for monitoring and automated decommissioning of network components that were not working correctly were also out of service. As a result, packets were lost until 1:43 p.m. Many routers still needed a manual restart, true to the motto “Boot feels good”.
Conclusion: follow-up action
Mistakes can happen. But you have to learn from it. Microsoft has now initially blocked commands with a large impact and subjected all executions to the “safe change guidelines”. The final review of the incident is to be published within fourteen days of the incident.
Leave a Reply