Project Tardigrade prevents platform failures
Project Tardigrade is a new service that aims to improve Azure resiliency. It includes mitigation strategies that protect Azure VMs against platform failures.
Here’s how Mark Russinovich, Chief Technology Officer at Microsoft Azure, is describing the current work on Azure:
Our goal is to empower organizations to run their workloads reliably on Azure. With this as our guiding principle, we are continuously investing in evolving the Azure platform to become fault resilient, not only to boost business productivity but also to provide a seamless customer experience.
To prevent impact to your workloads, the service enables components to self-heal and quickly recover from potential failures, even in critical host faults.
How does Project Tardigrade work?
Here’s an example on how the Tardigrade recovery workflow works:
- Phase 1: This step has no impact to running customer VMs. It simply recycles all services running on the host. In the rare case that the faulted service does not successfully restart, we proceed to Phase 2.
- Phase 2: Our diagnostics service runs on the host to collect all relevant logs/dumps systematically, to ensure that we can thoroughly diagnose the reason for failure in Phase 1. This comprehensive analysis allows us to ‘root cause’ the issue and thereby prevent reoccurrences in the future.
- Phase 3: At a high level, we reset the OS into a healthy state with minimal customer impact to mitigate the host issue. During this phase we preserve the states of each VM to RAM, after which we begin to reset the OS into a healthy state. While the OS swiftly resets underneath, running applications on all VMs hosted on the server briefly ‘freeze’ as the CPU is temporarily suspended. This experience is similar to a network connection temporarily lost but quickly resumed due to retry logic. After the OS is successfully reset, VMs consume their stored state and resume normal activity, thereby circumventing any potential VM reboots.
With this in mind, Project Tardigrade will ensure that the failure of any single component in the host does not impact the entire system. As such, customer VMs won’t be affected by host faults.
Microsoft is working hard to improve and expand the different host failure scenarios to make sure that their cloud computing platform is more reliable than ever.
Expect new developments and other reliability implementations in the near future.