Yesterday Microsoft’s cloud services suffered a major outage which Microsoft attributed to the inability to perform authentication operations on any Microsoft and third-party applications that depend on Azure Active Directory (Azure AD) for authentication.
The issue lasted in total for an unprecedented 14 hours and meant the workday was spoiled for many companies.
Today Microsoft posted a preliminary Root Cause Analysis which blamed an error that occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations.
Microsoft says as part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Metadata about the signing keys are published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.
The solution was simple. The key removal operation was identified as the cause, and the key metadata was rolled back to its prior state at 21:05 UTC. Unfortunately, a subset of Storage resources experienced residual impact due to cached metadata, and Microsoft needed to push an update to invalidate these entries and force a refresh. This process completed and mitigation for the residually impacted customers was declared at 09:25 UTC.
Microsoft says they have processes in place to prevent this class of risks but while it already provides protections for adding a new key, the remove key component is only scheduled to be finished by mid-year.
Microsoft apologized for the issue and say they are continuously taking steps to improve the Microsoft Azure Platform and their processes to help ensure such incidents do not occur in the future.
A full Root Cause Analysis investigation relating to this incident is still ongoing, and will be published when it is completed, or if any other substantive details emerge in the interim.
Read all the detail at Microsoft here.