Microsoft’s Multi-Factor Authentication is down again for some customers. Microsoft confirmed the issue on its status page and mentioned that impacted customers may encounter timeout errors. Azure Engineers are aware of this issue and are actively investigating mitigation options.
It is important to note that just last week Microsoft’s Multi-Factor Authentication service went down for several hours blocking millions of users from accessing various services including Office 365, Azure, Dynamics and other services which use Azure Active Directory for authentication. Microsoft recently posted the below root cause analysis for this outage.
There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time. The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.
2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.
3. The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.
Microsoft also mentioned that they are going to take the following steps to avoid such issues in the future.
- Review our update deployment procedures to better identify similar issues during our development and testing cycles (completion by Dec 2018)
- Review the monitoring services to identify ways to reduce detection time and quickly restore service (completion by Dec 2018)
- Review our containment process to avoid propagating an issue to other datacenters (completion by Jan 2019)
- Update communications process to the Service Health Dashboard and monitoring tools to detect publishing issues immediately during incidents (completion by Dec 2018)
We will update the post with latest information regarding today’s outage once Microsoft provides it.
Update from Microsoft:
CURRENT MITIGATION: Engineers are currently in the process of cycling backend services responsible for processing MFA requests. This mitigation step is being rolled out region by region with a number of regions already completed. Engineers are reassessing impact after each region completes. Engineers have also determined a Domain Name System (DNS) issue caused sign-in requests to fail, but this issue is mitigated and engineers are restarting the authentication infrastructure.