Microsoft Details Root Cause Of The Recent Azure Cloud Service Outage

November

20, 2014

Azure Outage

Recently, Azure storage services experienced a service interruption across the United States, Europe and parts of Asia, which impacted multiple cloud services in these regions. Microsoft today apologized for the disruption and provided root cause for the problem. The main cause of the issue was a performance update to Azure Storage, that caused reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services.

Summary

On November 19, 2014, Azure Storage Services were intermittently unavailable across regions which are listed in “Affected Regions” below. Microsoft Azure Services and customer services which have a dependency on the affected Azure Storage Services were impacted as well. This includes the Service Health Dashboard and Management Portal. This interruption was due to a bug that got triggered when a configuration change in the Azure Storage Front End component was made, resulting in the inability of the Blob Front-Ends to take traffic.

The configuration change had been introduced as part of an Azure Storage update to improve performance as well as reducing the CPU footprint for the Azure Table Front-Ends. This change had been deployed to some production clusters for the past few weeks and was performing as expected for the Table Front-Ends.

As part of a plan to improve performance of the Azure Storage Service, the decision was made to push the configuration change to the entire production service.

The configuration change for the Blob Front-Ends exposed a bug in the Blob Front-Ends, which had been previously performing as expected for the Table Front-Ends. This bug resulted in the Blob Front-Ends to go into an infinite loop not allowing it to take traffic.

Unfortunately the issue was wide spread, since the update was made across most regions in a short period of time due to operational error, instead of following the standard protocol of applying production changes in incremental batches.

Once the issue was detected, the configuration change was reverted promptly. However, the Blob Front-Ends had entered into an infinite loop triggered by the update, and couldn’t refresh the configuration without a restart. This caused the recovery to take longer. The Azure team investigated the mitigation steps and validated them. Once the mitigation steps were deployed, most of customers started seeing the availability improvement across regions at 11/19/2014 11:45:00 AM UTC. A subset of customers who were using the IaaS Virtual Machines (VMs) reported inability to connect to their VMs including through Remote Desktop Protocol (RDP) and SSH.

Read more about it here.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}