On November 18, Microsoft Azure customers experienced a service interruption that impacted Azure Storage and a few other services, including Virtual Machines. Microsoft today posted the final root cause analysis and how they are going to improve such situations in the future. First of all, the service outage was caused by while making configuration changes to Azure Storage. Read about it below.
During this deployment, there were two operational errors:
1. The standard flighting deployment policy of incrementally deploying changes across small slices was not followed.
The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk. Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure.
2. Although validation in test and pre-production had been done against Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends.
Enabling this change on the Azure Blob storage Front-Ends exposed a bug which resulted in some Azure Blob storage Front-Ends entering an infinite loop and unable to service requests.
Automated monitoring alerts notified our engineering team within minutes of the incident. We reverted the change globally within 30 minutes of the start of the issue which protected many Azure Blob storage Front-Ends from experiencing the issue. The Azure Blob storage Front-Ends which already entered the infinite loop were unable to accept any configuration changes due to the infinite loop. These required a restart after reverting the configuration change, extending the time to recover.
Microsoft provided the following improvements areas,
We are committed to improving your experience with the Azure Platform and are making the following improvements:
Storage Service Interruption:
- Ensure that the deployment tools enforce the deployment protocol of applying standard production changes in incremental batches.
Virtual Machine Service Interruption:
- Improve resiliency and recovery to “slow-boot” scenarios for Windows and Linux VMs.
- Improve detection and recovery from Windows Setup provisioning failures due to storage incidents.
- Fix to the Networking Service that caused networking programming errors for a subset of customers
- Fix the Service Health Dashboard misconfiguration that led to incorrect header status
- Implement new social media communication procedures to effectively communicate status using multiple mechanisms.
- Improve resiliency of the Service Health Dashboard and authoring tools.
- Improved resiliency of Microsoft Support automation tooling and infrastructure
Azure Chief Technology Officer Mark Russinovich sits down with Channel 9 to provide customers with a look into what happened, along with how the team is actively working to improve customer experiences on the Azure platform as a result.
Read more about it in detail here.