Microsoft Details Recent Outage, Failure In A Caching Service Caused The Issue

After resolving the web and mobile services, Microsoft today detailed the cause of the outage and how they are going to avoid it in the future. Here is their full response from service status website,

Update and Resolution of Recent Outage We want to apologize to our customers who were affected by the outage on this week. We have restored access to all accounts and have made changes so that the service will be more resilient in the future.  We realize that we have a responsibility to the customers who use our services to communicate and share with the people they care most about, and we apologize for letting those customers down this week. Our first priority is to the health of the services, and we will learn from this incident and work to improve the experience of all our customers. As part of that, we would also like to provide more detail about what happened. This incident was a result of a failure in a caching service that interfaces with devices using Exchange ActiveSync, including most smart phones.  The failure caused these devices to receive an error and continuously try to connect to our service.  This resulted in a flood of traffic that our services did not handle properly, with the effect that some customers were unable to access their email and unable to share their SkyDrive files via email.   In order to stabilize the overall email service, we temporarily blocked access via Exchange ActiveSync.  This allowed us to restore access to via the web and restore the sharing features of SkyDrive.  These parts of the service were fully stabilized within a few hours of the initial incident. A significant backlog of Exchange ActiveSync requests accumulated as we worked to stabilize access.  To avoid another flood of traffic, we needed to restore access to Exchange ActiveSync slowly, which meant that some customers remained impacted for a longer period of time.  We have learned from this incident, and have made two key changes to harden our systems against future failure – one that involved increasing network bandwidth in the affected part of the system, and one that involved changing the way error handling is done for devices using Exchange ActiveSync. We will continue to monitor the system and make additional changes as needed to keep the service healthy.  We are now fully through the backlog and have restored service so all customers should have normal access from all of their devices.    We want to apologize to everyone who was affected by the outage, and we appreciate the patience you have shown us as we worked through the issues.

Some links in the article may not be viewable as you are using an AdBlocker. Please add us to your whitelist to enable the website to function properly.