Umbraco Cloud sites unavailable

Blank

Umbraco Status page

Welcome to the Umbraco status page. On this page, you can see the current operational status as well as plans for scheduled maintenance and automatic upgrades for all our cloud offerings: Umbraco Cloud and Heartcore.

Subscribe to updates above to get the latest status send straight to your inbox.

If you’re experience issues with your cloud project which does not seem to relate to the current operational status, please go to Our Umbraco and search for the issue or reach out to the Umbraco Support in the portal chat.

Incident Report for Umbraco Cloud

Postmortem

During the period of Wednesday 27th of February 23:00 CET to Friday 1st of March 17:00 CET Umbraco Cloud suffered from a series of critical events that caused downtime and general instability for all sites hosted with the platform.

This post mortem will outline the events as they occurred, give details on the root cause and explain what steps we are taking to improve, and prevent similar incidents, based on these learnings.

Summary

The incident was first identified at 23:00 CET on the 27th and the general issue was partly resolved at midnight. However, the nature of the incident ment that recovery time for sites stretched into the morning of the 28th with ~98% at 06:00. At this point in time we began a root cause analysis, expecting that the incident was an isolated event.

At 06:45 CET on the 28th the issue repeated itself. At this point it was clear that this was not an isolated incident and we immediately initiated an all hands 24/7 schedule and began work on escalated root cause analysis, called in all customer support staff and increased our information efforts to ensure that everyone affected were aware of the situation. We also called in additional external support and consultancy from our partners at Microsoft.

The second issue was partly resolved at 08:00 but recovery again stretched for several hours with ~90% recovered at 10:00.

At this point, the root cause was not clearly identified but several work streams pointed in the direction of performance issues with a central component in our infrastructure orchestration services (see root cause below). From this point on, the services were continuously affected by the root cause in many different ways causing general recovery to alternate between 75% and 98% until we finally reached our maintenance window on February 28th at 16:30 CET. (see timeline below)

During the full period we suffered from no less than 3 full outages and 9 partial outages of varying nature causing a range of performance issues, error messages and timeouts for customer sites as their sites came back online or stopped responding correctly (see timeline below).

All these issues were resolved in the planned emergency maintenance window starting 16:30 CET on March 1st. At 17:30 CET ~98% of sites were fully recovered as we continued to apply additional improvements to the platform.

The maintenance closed as planned at 22:30 CET on the 1st of March. We maintained the 24/7 alert status during the whole weekend to ensure that we were available for anyone who reached out, and continued work to ensure all affected sites were properly recovered.

The incident officially closed on March 2nd at 14:18. At this point all sites, with very few exceptions handled on a 1:1 basis, were concluded to be working as expected.

Root Cause

The root cause of all these challenges was a severe performance issue with several tables in a database central to the orchestration and health service that generally controls the state of Umbraco Cloud. These tables all contain data collected from our systems used for health checks, load balance decisions and other automatic infrastructure decisions.

Once we had the root cause identified we needed additional time to plan and test the effect of improvements that we expected to carry out during maintenance, both to ensure that it would have a positive impact but also to rule out any additional unwanted negative side effects.

During the maintenance window the database performance was corrected and a series of additional performance improvements were also carried out. This had the expected effect and finally resolved the issue.

Follow Up

Several steps are currently being carried out to improve the situation and prevent this type of incident in the future, including but not limited to:

Improved monitoring on central components of the orchestration service, including the database in question.

Full review of the database setup for additional performance gains, long term stability and better tolerance for failure.

A full review for performance and stability throughout Umbraco Cloud to limit the risk of another incident and limit the impact of any incidents that occurs in the future.

A full review of our ability to provide insights to the status of our system to our customers during an incident.

Timeline

We apologize for the inconveniences this incident has caused you and your customers. We want to underline that our work to ensure stable operations on Umbraco Cloud does not end with this update but is ongoing and of the highest priority to us.

CTO - Jacob Midtgaard-Olesen

Umbraco HQ

Posted Mar 11, 2019 - 11:20 CET

Resolved

Update March 2nd. 14:18 CET

Since Wednesday February 27th, 23:24 CET and until March 1st, 19:00 CET we have had multiple incidents taking customers Umbraco Cloud sites offline several times.

What did we do?
In short we stopped all other activities and everyone at Umbraco HQ joined in either directly contributing to solve the problem, or to get information out to the affected customers and helping them where possible.

Additionally we raised suppliers and external consultants to help ensure that no stones were left unturned in the process of solving the problems.

A maintenance window was announced with short notice in order to allow us to do what was necessary in order to bring stability to the system.

Current status
Overall and since March 1st, 19:00 CET the system has been performing well and as prior to the incident, except for a few individual sites that has not been coming back online as expected. We have been working on those and they are now back in normal operations again.

We have seen individual sites being restarted due to balancing of the system. This is part of a normal situation.

Monitoring is heightened and we continue our extended monitoring and support throughout the weekend.

What’s next
Now the analysis will begin and we will be presenting a post-mortem in due course.

We will follow the status of the system closely hence we have increased our monitoring. This includes staff on duty around the clock.

Should you experience problems with your site, please reach out to our support at contact@umbraco.com. It is manned around the clock.

We apologize for the problems this incident have caused for you and your customers.

Best regards - The Umbraco HQ team

Posted Mar 02, 2019 - 14:18 CET

Monitoring

Update March 2nd. 11:55 CET

Operations have been stable since our last update and most sites are running as they should.

We are continuously working on any sites facing issues that could potentially be related to this incident.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com. Then we will make sure to look into your case as soon as possible to have all sites running as they should.

The incident remains open until we are certain operations are back to normal and sites are running as expected.

Next update no later than 15:00 CET.

Posted Mar 02, 2019 - 11:55 CET

Update

Update March 2nd. 08:55 CET

Operations are continuing to look normal and most sites are running as expected. It is still possible that some sites will briefly become unavailable when load is being balanced, which is part of normal operations.

We will keep this incident open as long as we are actively working on any sites that are facing issues potentially related to this incident.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com.

Next update no later than 12:00 CET.

Posted Mar 02, 2019 - 08:55 CET

Update

Update March 2nd. 06:55 CET

Operations are still looking normal and most sites are running as they should.

It will still be possible to see sites briefly becoming unavailable when load is being balanced, which is part of normal operations.

This issue will remain active until we are certain operations are back to normal.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com.

Next update no later than 09:00 CET.

Posted Mar 02, 2019 - 06:55 CET

Update

Update March 2nd. 04:55 CET

We have seen normal operations since the last update and most sites are running as they should.

The situation is unchanged regarding sites that may potentially be down shortly, while load is being balanced. Even though this is normal, we also realize that in the current situation, anything that looks like performance issues, can cause alarm.

We will keep this issue active until we are certain operations are back to normal.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com.

Next update no later than 07:00 CET.

Posted Mar 02, 2019 - 04:55 CET

Update

Update March 2nd. 01:55 CET

Operations are still looking normal and most sites are running as expected.

We will leave this incident open as long as we are working on any sites facing issues that could potentially be related to this incident. There is still a chance that you will experience your site being down for a short period as we complete our balancing for performance.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com.

Next update no later than 05:00 CET.

Posted Mar 02, 2019 - 01:55 CET

Update

Update 23:55 CET

Operations continue to look normal after the maintenance and most sites are running as they should.

We will keep this incident open as we work on any sites facing issues. You might experience your site being down for a short period as we complete our balancing for performance.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com.

Next update no later than 02:00 CET.

Posted Mar 01, 2019 - 23:55 CET

Identified

Update 22:35 CET

The maintenance has been successfully completed and we are seeing positive impacts as a result.

Incident remains open as we work on any sites facing issues. You might experience your site being down for a short period as we complete our balancing for performance.

Should you have any problems with your site, please reach out to our support at contact@umbraco.com.

Next update no later than 00:00 CET.

Posted Mar 01, 2019 - 22:35 CET

Monitoring

Update 22:00 CET

The maintenance was completed as expected. We are currently monitoring and validating the impact of the changes we applied.

Please reach out to our support at contact@umbraco.com should you experience any issues with your site at this point.

We will give you the next update no later than 23:00 CET.

Posted Mar 01, 2019 - 22:01 CET

Update

Update 21:00 CET

Nothing has changed since our last update at 20:00 CET.

The progress with the Umbraco Cloud maintenance continues as expected. If you are experiencing any problems, please reach out to us on contact@umbraco.com

Next update will be here no later than 22:00 CET.

Posted Mar 01, 2019 - 21:01 CET

Update

Update 20:00 CET

The maintenance is still progressing as expected. We are only seeing a limited number of sites currently affected. If you are experiencing anything differently, please reach out to us at contact@umbraco.com

Next update will be here no later than 21:00 CET.

Posted Mar 01, 2019 - 19:55 CET

Update

Update 19:00 CET

The maintenance is still progressing as expected. We are currently only experiencing that a limited number of sites are affected. If you are experiencing anything differently, please reach out to us at contact@umbraco.com

Next update will be here no later than 20:00 CET.

Posted Mar 01, 2019 - 18:54 CET

Update

Update 18:00

The maintenance is progressing as expected. Currently, we are experiencing that only a limited number of sites are affected by service interruptions.

Next update will be here no later than 19:00 CET.

Posted Mar 01, 2019 - 17:57 CET

Update

Update 17:00

Scheduled maintenance is now in progress and will continue until 22:30 CET.

As announced, during the maintenance you’ll see a number of service interruptions (websites offline). Rest assured that we only do this when necessary.

Next update will be here no later than 18:00 CET.

Posted Mar 01, 2019 - 16:58 CET

Update

Update 16:00 CET.

We are still seeing unavailable Umbraco Cloud sites. The announced scheduled maintenance will start in 30 minutes at 16:30 CET. This maintenance may not be the only fix needed but it is a necessary step in order to make progress in solving the issue.

We are working hard to mitigate the effects and will continue to keep you updated here on an hourly basis.

Next update will be here no later than 17:00 CET.

Posted Mar 01, 2019 - 15:59 CET

Update

Update 14:38 CET

We have now identified what we believe is the root cause of the ongoing Umbraco Cloud issue and have to call an extraordinary maintenance window in order to be able to continue our work on a solution.

This might not be the only fix needed but it is a necessary step in order to make progress in solving the issue.

This maintenance will take place on March 1st at 16:30 - 22:30 CET. We will continue to update you here during the maintenance. You can read the full summary of the maintenance below this incident.

We will give you the next update no later than 16:00 CET.

Posted Mar 01, 2019 - 14:39 CET

Update

Update 14:00 CET

We are still seeing unavailable Umbraco Cloud sites. We are working hard to mitigate the effects and will continue to keep you updated here on an hourly basis.

Next update is no later than 15:00 CET.

Posted Mar 01, 2019 - 13:55 CET

Update

Update 13:00 CET

Since our last update at 12:00 CET the situation remains the same.

Some customers might still experience partial downtime during this period.

We continue working closely with the Microsoft Azure team and have called in external resources to help us solve the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

Please read the statement on the matter by Umbraco management: https://umbraco.com/blog/management-statement-umbraco-cloud-incident/

We will update you as the situation changes, no later than 14:00 CET.

Posted Mar 01, 2019 - 12:55 CET

Update

Update 12:00 CET

The situation hasn’t changed since our last update at 11:00 CET.

Some customers might experience partial downtime during this period.

We are still working closely with the Microsoft Azure team and have called in external resources to help us solve the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

Please read the statement on the matter by Umbraco management: https://umbraco.com/blog/management-statement-umbraco-cloud-incident/

We will give you the next update no later than 13:00 CET.

Posted Mar 01, 2019 - 11:55 CET

Update

Update 11:00 CET

Since we last updated at 10:00 CET, the situation remains the same.

Some customers might experience partial downtime during this period.

We are working closely with the Microsoft Azure team and have called in external resources to help us solve the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

Please read the statement on the matter by Umbraco management: https://umbraco.com/blog/management-statement-umbraco-cloud-incident/

We will update you as the situation changes, no later than 12:00 CET.

Posted Mar 01, 2019 - 10:55 CET

Update

Update 10:00 CET

The situation is the same since we last updated at 09:00 CET.

Some customers will experience partial downtime during this period.

We continue to collaborate with the Microsoft Azure team and have called in external resources to help us solve the issues.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

Please read the statement on the matter by Umbraco management: https://umbraco.com/blog/management-statement-umbraco-cloud-incident/

We will update you as the situation changes, no later than 11:00 CET.

Posted Mar 01, 2019 - 10:00 CET

Update

Update 09:00 CET

Since our last update at 08:00 CET the situation remains the same.

Some customers will experience partial downtime during this period.

We continue to collaborate with the Microsoft Azure team and have called in external resources to help us solve the issues.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 10:00 CET.

Posted Mar 01, 2019 - 08:56 CET

Update

The situation is still the same as when we last updated at 06:00 CET.

Some customers will experience partial downtime during this period.

We are continuing our collaboration with the Microsoft Azure team and have called in external resources to help us solve the issues.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 09:00 CET.

Posted Mar 01, 2019 - 07:59 CET

Update

Update 06:00 CET

Since our update at 05:00 CET the situation hasn't changed.
Some customer will experience partial downtime during this period.

We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 08:00 CET.

Posted Mar 01, 2019 - 06:00 CET

Update

Update 05:00 CET

Since our update at 04:00 CET the situation hasn't changed.
Some customer will experience partial downtime during this period.

We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 06:00 CET.

Posted Mar 01, 2019 - 05:02 CET

Update

Update 04:00 CET:

Since our update at 02:50 CET sites have been unavailable multiple times.
Most customers will experience their sites becoming available at this point in time, but this can unfortunately still change as we continue our efforts to resolve the issue.

We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 05:00 CET.

Posted Mar 01, 2019 - 04:00 CET

Update

We are currently experiencing additional site unavailability challenges and are working to resolve them.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 04:00 CET.

Posted Mar 01, 2019 - 02:50 CET

Identified

01:00 CET:

Since our update at 23:00 CET some customers did experience sites being unavailable.
Most customers will experience their sites being available at this point in time, but this can unfortunately still change as we continue our efforts to resolve the issue.

We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

We will update you as the situation changes, no later than 03:00 CET.

Posted Mar 01, 2019 - 01:01 CET

Update

23:00 CET: Since our update at 20:00 CET the situation hasn't changed. Some customers may during this time experience sites unavailable again as we continue to work on reducing the general impact of the incident.

We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

Unless the situation changes, we will update you again no later than 01:00 CET.

Posted Feb 28, 2019 - 23:00 CET

Update

20:00 CET: Since our last update at 18:00 CET the situation hasn’t changed.

We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore the stability of Umbraco Cloud.

Unless the situation changes, we will update you again no later than 23:00 CET.

Posted Feb 28, 2019 - 19:50 CET

Update

18:00 CET: Since our last update at 17:00 CET the situation hasn’t changed. We are still working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore stability of Umbraco Cloud.

Unless the situation changes, we will update you again no later than 20:00 CET.

Posted Feb 28, 2019 - 17:57 CET

Update

17:00 CET: There are still Cloud sites that are unavailable and work continues to fix the issues.

We have identified a possible cause of this issue. We are working closely together with the Microsoft Azure team who are actively assisting us in reducing impact and solving the issue.

Perhaps needless to say, but we have stopped all other activities a long time ago in order to exclusively focus on this issue. We won’t stop this until it’s over; hence, we have taken the necessary steps for us to work as long as needed, even throughout the night.

We apologize for any inconvenience you and your customers might have experienced and rest assured we will continue to work hard in order to restore stability of the Umbraco Cloud platform.

Next update will be no later than 18:00 CET

Posted Feb 28, 2019 - 16:58 CET

Update

There are still Cloud sites that are unavailable and work continues to fix the issues. We will update you here and by email notifications which are now turned back on again. Sorry about not having those enabled in the last updates. Next update is not later than 17:00 CET

Posted Feb 28, 2019 - 15:57 CET

Update

We are still seeing unavailable Cloud sites. We are at work to mitigate the effects and will keep you updated here at least once per hour. Next update is not later than 16:00 CET

Posted Feb 28, 2019 - 14:55 CET

Update

We are still seeing unavailable Cloud sites. We are at work to mitigate the effects and will keep you updated here at least once per hour. Next update is not later than 15:00 CET

Posted Feb 28, 2019 - 14:40 CET

Update

Operations continue to look normal and most sites are running as they should. We are monitoring the status closely and are in touch with anyone still affected. Should you experience that your site is not running as expected, and that we are not already in touch with you, please reach out at contact@umbraco.com

We will provide an updated status here no later than at 15:00 CET

Posted Feb 28, 2019 - 13:55 CET

Update

Most sites are now back to normal operations and we are working on getting the lasts sites up and running again, as well as on stability of the sites going forward. We will provide an updated status here no later than at 14:00 CET

Posted Feb 28, 2019 - 12:55 CET

Update

We are seeing more sites becoming unavailable again and are working hard to mitigate the problems. We will update here with more info no later than at 13:00 CET.

Posted Feb 28, 2019 - 11:59 CET

Monitoring

Even more sites are now back in normal operations. A few sites are not yet online and we are are handling these sites directly by reaching out and working with those affected to get their sites back online.

If you are experiencing that your site is not working as intended and we are not already in touch with you, please reach out to us on contact@umbraco.com and then we will have someone on the case immediately.

Next update will be no later than at: 12:00 CET

Posted Feb 28, 2019 - 10:59 CET

Update

Most sites are now back in normal operations. We are continuing work on making the remaining sites available again and will update here with an updated status no later than at 11:00 CET.

Should you experience something different, please reach out to our support on contact@umbraco.com

Posted Feb 28, 2019 - 09:54 CET

Update

We are getting sites back online, but some are still impacted. We are continuing our efforts to restore all services and will keep you updated here no later than at 10:00 CET.

Posted Feb 28, 2019 - 08:55 CET

Identified

We are still seeing unavailable Cloud sites. Everyone who are able to help with this are working to resolve the issue.

Next update will be no later than at 09:00 CET.

Posted Feb 28, 2019 - 07:59 CET

Investigating

We are seeing Cloud sites becoming unavailable and are working to solve the issue. We will update here with more info no later than at 08:00 CET.

Posted Feb 28, 2019 - 07:11 CET