Summary

On Tuesday, June 28 2022 a scheduled upgrade for Umbraco-ID, the authentication service on Umbraco Cloud, was rolled out. This, unfortunately, resulted in sporadic instances where customers were unable to access the backoffice on Umbraco Cloud.

The affected component was a central internal API that orchestrates procedures between Umbraco ID and Azure B2C. In the two-week period between the beginning of the incident and the resolving the incident, we counted 12 outages ranging from 4 minutes to a maximum of 20 minutes.

We apologize for the inconvenience that this has caused our customers, and assure you that we have taken steps to address this specific issue and are working on initiatives to ensure that this type of issue will not happen again.

‌

Root Cause Analysis

In preparation for the upcoming regional hosting options for Umbraco Cloud, an update was rolled out to Umbraco ID including updates to the underlying infrastructure and a series of DNS changes.

After the release, a series of post-deployment tests were completed successfully.

The Umbraco Cloud support team reported an increase in tickets raised, specifically concerning backoffice authentication (login).

At the same time, we saw the execution count (requests processed)on Umbraco Id would drop to 0 and a couple of minutes later would recover, and start processing requests as normal. The Umbraco Cloud status page was updated to notify our customers that there was an issue regarding authentication.

Though we have a lot of telemetry on the Cloud services, we were unable to identify the cause of the issue and we contacted our partners at Microsoft to help mitigate the problem. We Investigated the following issues with support professionals at Microsoft to help identify the problems

CPU/Memory utilization
SNAT Port Exhaustion

While the investigation was ongoing we replaced the infrastructure on which Umbraco Id was running and started to see immediate positive ingestion of traffic. Together with support professionals at Microsoft, we decided to lower the severity of the support case to reflect that there were no longer any business-critical components affected.

Late on Thursday, July 7, the issue reappeared and this time we responded immediately by reaching out to Microsoft. The intermittent nature of the issue still made it hard to predict the exact cause but we started by investigating further:

Throttling of external 3rd party dependencies Azure AD and slow dependencies to internal services.
CPU/Memory utilization: Though there were no signs of resource starvation we increased the size of the infrastructure on which the component was running on. We added additional fallback resources to cover scenarios where our primary services would fail.

‌

Our underlying component that handles all communication between UmbracoId and Azure AD is an Azure Function. We see an average of 200 to 300 requests per minute and up to 2000 requests per minute at peak times. When the resource detects that functions are executing slowly, it will auto-heal and recycle the application which is what we see in the figure above. What we saw was that corresponded well with the timing of the incidents.

‌

So why did the response time slowly deteriorate?

From here we did a memory allocation analysis and discovered that the memory cleanup (garbage collection) was running quite often and causing slow response times. That in turn triggered the auto-heal functionality and recycled the underlying App Service Plan. Additionally, we found that the default was set to 32-bit processing which would explain why the application was struggling. 32-bit processes can only access a limited amount of memory and provided that there was a lot of traffic, the garbage collector simply could not keep up with the memory restrictions.

Actions based on the root cause analysis

First and foremost we identified that the default processing architecture on Azure Functions was a limiting factor and have upped this to 64-Bit. Enabling this immediately allows for more efficient memory management, additionally, we have updated this across our services to ensure we do not encounter the same challenges.

To avoid similar issues in the future we are implementing policies for Azure Functions ensuring 64-Bit processing is the default going forward for any new infrastructure added to the platform.

To ensure that we always have resources available, we have increased the replication and fallback resources as a contingency should our primary compute unit fail.

If you have any questions related to the above please feel free to contact your partner manager, reach out through our support channels or the Umbraco Cloud issue tracker on GitHub: https://github.com/umbraco/Umbraco.Cloud.Issues/issues

Posted Jul 22, 2022 - 14:04 CEST

Resolved

This incident has been resolved.

Posted Jul 01, 2022 - 08:02 CEST

Update

For the last 15 hours, we have seen stable operations. We are continuing to actively monitor for any issues and we have partnered up with the Microsoft Support Engineering team, to figure out the root cause of these issues.

Posted Jun 30, 2022 - 11:46 CEST

Update

We have not seen any critical issues for the last 12 hours, but are getting reports from a subset of customers that they'd need to clear the browser cache and cookies. We are continually monitoring the ongoing issue.

Posted Jun 29, 2022 - 11:23 CEST

Monitoring

A critical bug in one of our main components was discovered and we successfully released a fix and seeing positive ingestion of traffic. We'll keep monitoring for regression.

Posted Jun 28, 2022 - 21:46 CEST

Identified

After the aforementioned fix, we are still experiencing a regression of services. We are in contact with our partners at Microsoft to help resolve the issue. Sorry for the inconvenience.

Posted Jun 28, 2022 - 14:10 CEST

Monitoring

A fix has been implemented and we are actively monitoring the result

Posted Jun 28, 2022 - 12:04 CEST

Identified

The issue has been identified and our engineering teams are working on a fix.

Posted Jun 28, 2022 - 11:54 CEST

Investigating

We are currently investigating an issue affecting a subset of customers who are unable to login into back-office using UmbracoID

Posted Jun 28, 2022 - 11:23 CEST

This incident affected: Umbraco Id.