Microsoft Azure is a reliable cloud service with 99.995 percent average uptime over the past 12 months. But some Azure services experienced significant incidents that impacted millions of customers around the world. A datacenter outage in the South Central US region in September 2018, Azure Active Directory (Azure AD) Multi-Factor Authentication (MFA) challenges in November 2018, and DNS maintenance issues in May 2019 are the three major incidents. Azure CEO Mark Russinovich yesterday published a blog post discussing the the ways Microsoft is trying to further improve the reliability of Azure services.
First of all, Microsoft has formed a new Quality Engineering team within Azure CTO office, this team will work with Site Reliability Engineering (SRE) team to find new approaches to deliver an even more reliable cloud platform. Read about some of the initiatives started by Microsoft to improve Azure reliability:
- Safe deployment practices – Azure approaches change automation through a safe deployment practice framework which aims to ensure that all code and configuration changes go through a cycle of specific stages. These stages include dev/test, staging, private previews, a hardware diversity pilot, and longer validation periods before a broader rollout to region pairs. This has dramatically reduced the risk that software changes will have negative impacts, and we are extending this mechanism to include software-defined infrastructure changes, such as networking and DNS.
- Storage-account level failover – During the September 2018 datacenter outage, several storage stamps were physically damaged, requiring their immediate shut down. Because it is our policy to prioritize data retention over time-to-restore, we chose to endure a longer outage to ensure that we could restore all customer data successfully. A number of you have told us that you want more flexibility to make this decision for your own organizations, so we are empowering customers by previewing the ability to initiate your own failover at the storage-account level.
- Expanding availability zones – Today, we have availability zones live in the 10 largest Azure regions, providing an additional reliability option for the majority of our customers. We are also underway to bring availability zones to the next 10 largest Azure regions between now and 2021.
- Project Tardigrade – At Build last month, I discussed Project Tardigrade, a new Azure service named after the nearly indestructible microscopic animals also known as water bears. This effort will detect hardware failures or memory leaks that can lead to operating system crashes just before they occur, so that Azure can then freeze virtual machines for a few seconds so the workloads can be moved to a healthy host.
- Low to zero impactful maintenance – We’re investing in improving zero-impact and low-impact update technologies including hot patching, live migration, and in-place migration. We’ve deployed dozens of security and reliability patches to host infrastructure in the past year, many of which were implemented with no customer impact or downtime. We continue to invest in these technologies to bring their benefits to even more Azure services.
- Fault injection and stress testing – Validating that systems will perform as designed in the face of failures is possible only by subjecting them to those failures. We’re increasingly fault injecting our services before they go to production, both at a small scale with service-specific load stress and failures, but also at regional and AZ scale with full region and AZ failure drills in our private canary regions. Our plan is to eventually make these fault injection services available to customers so that they can perform the same validation on their own applications and services.
“Reliability is and continues to be a core tenet of our trusted cloud commitments, alongside compliance, security, privacy, and transparency. Across all these areas, we know that customer trust is earned and must be maintained, not just by saying the right thing but by doing the right thing”, wrote Mark Russinovich.