Welcome to the Umbraco status page. On this page, you can see the current operational status as well as plans for scheduled maintenance and automatic upgrades for all our cloud offerings: Umbraco Cloud and Heartcore.
Subscribe to updates above to get the latest status send straight to your inbox.
If you’re experience issues with your cloud project which does not seem to relate to the current operational status, please go to Our Umbraco and search for the issue or reach out to the Umbraco Support in the portal chat.
During the period of Wednesday 27th of February 23:00 CET to Friday 1st of March 17:00 CET Umbraco Cloud suffered from a series of critical events that caused downtime and general instability for all sites hosted with the platform.
This post mortem will outline the events as they occurred, give details on the root cause and explain what steps we are taking to improve, and prevent similar incidents, based on these learnings.
The incident was first identified at 23:00 CET on the 27th and the general issue was partly resolved at midnight. However, the nature of the incident ment that recovery time for sites stretched into the morning of the 28th with ~98% at 06:00. At this point in time we began a root cause analysis, expecting that the incident was an isolated event.
At 06:45 CET on the 28th the issue repeated itself. At this point it was clear that this was not an isolated incident and we immediately initiated an all hands 24/7 schedule and began work on escalated root cause analysis, called in all customer support staff and increased our information efforts to ensure that everyone affected were aware of the situation. We also called in additional external support and consultancy from our partners at Microsoft.
The second issue was partly resolved at 08:00 but recovery again stretched for several hours with ~90% recovered at 10:00.
At this point, the root cause was not clearly identified but several work streams pointed in the direction of performance issues with a central component in our infrastructure orchestration services (see root cause below). From this point on, the services were continuously affected by the root cause in many different ways causing general recovery to alternate between 75% and 98% until we finally reached our maintenance window on February 28th at 16:30 CET. (see timeline below)
During the full period we suffered from no less than 3 full outages and 9 partial outages of varying nature causing a range of performance issues, error messages and timeouts for customer sites as their sites came back online or stopped responding correctly (see timeline below).
All these issues were resolved in the planned emergency maintenance window starting 16:30 CET on March 1st. At 17:30 CET ~98% of sites were fully recovered as we continued to apply additional improvements to the platform.
The maintenance closed as planned at 22:30 CET on the 1st of March. We maintained the 24/7 alert status during the whole weekend to ensure that we were available for anyone who reached out, and continued work to ensure all affected sites were properly recovered.
The incident officially closed on March 2nd at 14:18. At this point all sites, with very few exceptions handled on a 1:1 basis, were concluded to be working as expected.
The root cause of all these challenges was a severe performance issue with several tables in a database central to the orchestration and health service that generally controls the state of Umbraco Cloud. These tables all contain data collected from our systems used for health checks, load balance decisions and other automatic infrastructure decisions.
Once we had the root cause identified we needed additional time to plan and test the effect of improvements that we expected to carry out during maintenance, both to ensure that it would have a positive impact but also to rule out any additional unwanted negative side effects.
During the maintenance window the database performance was corrected and a series of additional performance improvements were also carried out. This had the expected effect and finally resolved the issue.
Several steps are currently being carried out to improve the situation and prevent this type of incident in the future, including but not limited to:
Improved monitoring on central components of the orchestration service, including the database in question.
Full review of the database setup for additional performance gains, long term stability and better tolerance for failure.
A full review for performance and stability throughout Umbraco Cloud to limit the risk of another incident and limit the impact of any incidents that occurs in the future.
A full review of our ability to provide insights to the status of our system to our customers during an incident.
We apologize for the inconveniences this incident has caused you and your customers. We want to underline that our work to ensure stable operations on Umbraco Cloud does not end with this update but is ongoing and of the highest priority to us.
CTO - Jacob Midtgaard-Olesen
Umbraco HQ