Right now, Microsoft only offers 99.9% SLA for Azure AD user authentication. Microsoft yesterday announced that it will offer 99.99% uptime for Azure AD user authentication. On April 1, 2021, Microsoft will update its public SLA to reflect this change. Microsoft will share its roadmap for the next generation of resilience investments for Azure AD and Azure AD B2C in early 2021.
Azure AD is a huge service with more than 400 million Monthly Active Users (MAU). Azure AD is processing tens of billions of authentications per day. Increasing the reliability of such a huge online service is a challenging task. In the past few months, Microsoft has made the following improvements to increase the reliability of Azure AD.
- We’ve made strong progress on moving the authentication services to a fine-grained fault domain isolation model — also called “cellularized architecture”. This architecture is designed to scope and isolate the impact of many classes of failures to a small percentage of total users in the system. In the last year, we’ve increased the number of fault domains by over 5x and will continue to evolve this further over the next year.
- We have begun rollout of an Azure AD Backup Authentication service that runs with decorrelated failure modes from the primary Azure AD system. This backup service transparently and automatically handles authentications for participating workloads as an additional layer of resilience on top of the multiple levels of redundancy in Azure AD. You can think of this as a backup generator or uninterrupted power supply (UPS) designed to provide additional fault tolerance while staying completely transparent and automatic to you. At present, Outlook Web Access and SharePoint Online are integrated with this system. We will roll out the protections across critical Microsoft apps and services over the next few quarters.
- For Azure infrastructure authentication, our managed identity for Azure resources capabilities are now transparently integrated with regional authentication endpoints. These regional endpoints provide significant additional layers of resilience and protection, even in the event of an outage in the primary Azure AD authentication system.
- We’ve continued to make investments in the scalability and elasticity of the service. These investments were proven out during the early days of the COVID crisis, when we saw surging growth in demand. We were able to seamlessly scale what is already the world’s largest enterprise authentication system without impact. This included not just aggregate growth but very rapid onboarding, including entire nations moving their school systems (millions of users) online overnight.
- We are rolling out innovations to the authentication system such as Continuous Access Evaluation Protocol for critical Microsoft 365 services (CAE). CAE both improves security by providing instant enforcement of policy changes and improves resilience by securely providing longer token lifetimes.