Project Tardigrade safeguards your VMs against host faults

2 minute read
Project Tardigrade safeguards your VMs against platform failures

Home » News » Project Tardigrade safeguards your VMs against host faults

After multiple Azure changes and security improvements in the last couple of months, Microsoft is introducing Project Tardigrade as their newest attempt at making Azure more reliable.

Project Tardigrade prevents platform failures

Project Tardigrade is a new service that aims to improve Azure resiliency. It includes mitigation strategies that protect Azure VMs against platform failures.

Here’s how Mark Russinovich, Chief Technology Officer at Microsoft Azure, is describing the current work on Azure:

Our goal is to empower organizations to run their workloads reliably on Azure. With this as our guiding principle, we are continuously investing in evolving the Azure platform to become fault resilient, not only to boost business productivity but also to provide a seamless customer experience.

To prevent impact to your workloads, the service enables components to self-heal and quickly recover from potential failures, even in critical host faults.

How does Project Tardigrade work?

Here’s an example on how the Tardigrade recovery workflow works:

  • Phase 1: This step has no impact to running customer VMs. It simply recycles all services running on the host. In the rare case that the faulted service does not successfully restart, we proceed to Phase 2.
  • Phase 2: Our diagnostics service runs on the host to collect all relevant logs/dumps systematically, to ensure that we can thoroughly diagnose the reason for failure in Phase 1. This comprehensive analysis allows us to ‘root cause’ the issue and thereby prevent reoccurrences in the future.
  • Phase 3: At a high level, we reset the OS into a healthy state with minimal customer impact to mitigate the host issue. During this phase we preserve the states of each VM to RAM, after which we begin to reset the OS into a healthy state. While the OS swiftly resets underneath, running applications on all VMs hosted on the server briefly ‘freeze’ as the CPU is temporarily suspended. This experience is similar to a network connection temporarily lost but quickly resumed due to retry logic. After the OS is successfully reset, VMs consume their stored state and resume normal activity, thereby circumventing any potential VM reboots.

With this in mind, Project Tardigrade will ensure that the failure of any single component in the host does not impact the entire system. As such, customer VMs won’t be affected by host faults.

Microsoft is working hard to improve and expand the different host failure scenarios to make sure that their cloud computing platform is more reliable than ever.

Expect new developments and other reliability implementations in the near future.

Discussions

Next up

Google Chrome doesn’t finish downloads? Try this

Alexandru Voiculescu By: Alexandru Voiculescu
2 minute read

We all know how good Google Chrome is. This is why the browser became the most popular surfing tool in the world. Usually, Chrome is […]

Continue Reading

Not enough physical memory error in VMware [FULL FIX]

Rabia Noureen avatar. By: Rabia Noureen
2 minute read

A large number of VMware users have reported encountering the following error message: Not enough physical memory is available to power this virtual machine with […]

Continue Reading

Troubles with your Uplay login? Here’s what you can do

John Taylor avatar. By: John Taylor
2 minute read

Some users have reported that they are having problems when trying to log in their Uplay client. Apparently, they receive an error message that looks […]

Continue Reading