Machine learning: the AIOps system used by Azure to make the cloud more reliable

Image: iStock

Cloud services change all the time, from adding new features to fixing bugs and security vulnerabilities; this is one of the big advantages over on-premises software. But every change is also an opportunity to introduce the bugs and regressions that are the main causes of cloud reliability and unavailability issues. To try to avoid such issues, Azure uses a secure deployment process that deploys updates in phases, running them on increasingly larger infrastructure rings and using continuous AI-powered monitoring. to catch any issues that were missed during development and testing.

When Microsoft launched its Chaos Studio service to test how workloads handle unexpected outages last year, Azure CTO Mark Russinovich explained the secure deployment process. “We go through a canary cluster as part of our secure deployment, which is an internal Azure region where we have synthetic testing and we have internal workloads that actually test services before they go live. This is the first production environment hit by the new service update code, so we want to make sure we can validate it and get a good idea of ​​how good it is before moving it around and having it hit the clients.

SEE: Recruitment Kit: Cloud Engineer (TechRepublic Premium)

After the Canary region, the code rolls out to a pilot region, then a low-usage region, then a more heavily-used region, then gradually to all Azure regions (which are paired geographically, with updates going to a region in each pair first, then to the next). Throughout this deployment process, he explained, “We have AIOps monitoring everything to look for regressions.”

AIOps (techniques using big data, machine learning, and visualization to automate IT operations) can detect issues that developers cannot find by debugging their code because they may be caused by dependencies or interactions that will only come into play when the code is live and running. used in conjunction with other Azure services.

Improper deployment can crash virtual machines, slow them down, make them slower to provision, or prevent them from communicating, or they can affect monitoring agents, storage, telemetry, or control plane operations; but these issues can also be caused by hardware failure, temporary network issues, or timeouts in service APIs that rolling back to the last deployment would not resolve. There are hundreds of deployments per day on Azure, and most deployments target hundreds or thousands of clusters, all of which need to be monitored, and a deployment can take a long time (from ten minutes to 18 hours). With thousands of components running in over 200 data centers and over 60 regions, when issues like a memory leak might not show up for days, or might show up as very subtle issues in many clusters that add to a significant problem across an entire region, it is difficult for human operators to determine exactly which changes are causing a specific problem, especially if they are caused by interactions with another component or service.

The AIOps system used by Microsoft, called Gandalf, “monitors deployment and health signals in the new release and long term and finds correlations, even if [they’re] not obvious,” Microsoft said. Gandalf examines performance data (including CPU and memory usage), failure signals (such as operating system crashes, node faults and virtual machine restarts as well as failures API call center in the control plane) and takes information from other Azure services that track failures to detect issues and track them down to specific deployments.

It knows when a deployment occurs and looks at the number of nodes, clusters, and clients a failure would affect, to recommend whether new code is safe to deploy to Azure or should be blocked because it causes disruptions. problems in the canary region which translate into significant production problems.

Gandalf captures outage information one hour before and after each deployment as streaming data in Azure Data Explorer (also known as Kusto), which is designed for rapid analysis: it typically takes Gandalf about five minutes to make a deployment decision. It also tracks system behavior for 30 days after deployment, to spot longer-term issues (these decisions take about three hours).

SEE: iCloud vs. OneDrive: Which is better for Mac, iPad, and iPhone users? (free PDF) (TechRepublic)

This isn’t the only technique Microsoft uses to make Azure more resilient. “A memory leak caused by a new regressed payload would be stopped by Gandalf. Meanwhile, we have a resiliency mechanism to automatically mitigate already deployed nodes with leaking issues such as node reboot if there are no client workloads, or live migration of VMs running if the node is not empty.

“AIOps is good at detecting naturally occurring patterns and making correlations based on historical data as well as training,” Microsoft said. The issues it finds relate to the new deployment payload, but there are other issues like ay-zero bugs that Azure uses other techniques like chaos testing to find. “Zero-day bugs can be triggered by rare workloads, manifest in both previous and new releases, and occur randomly, or have no strong correlation to the new deployment. Testing chaos can catch such bugs by randomly introducing failures and testing that the system holds up as expected.

Gandalf has been running for nearly four years, initially for some key Azure infrastructure components, halting deployments that would otherwise cause critical failures. It now covers more Azure components, including hot Azure hosts; “We create holistic monitoring solutions for Azure host resource health and block deployments that leak host resources like memory, disk space,” Microsoft said.

“We develop insights to ensure the quality of new releases of Azure infrastructure components using AIOps before deploying a component to production. The key idea is to create a pre-production environment capable of running A/B tests for representative client workloads. This pre-production environment also has a good representation of settings in a production environment (hardware, VM SKU, etc.). This system is receiving feedback from Gandalf, so similar issues captured in the production environment will be avoided at launch.

Gandalf now watches more signals. “We started exploring the ideas of correlating signals across the Azure stack from the data centers ([like] temperature, humidity), hardware, host environment, customer experience. And it’s getting smarter and smarter to correlate failures. “We strive to give more weight to failures that affect critical customers or high-cost services,” a spokesperson said.

It is also applied to changes made to settings in Azure as well as components that make up the service. “In addition to payload deployment security, we are developing intelligence to make all changes (parameters) in production safe.”

AIOps for business

Gandalf is part of how Microsoft protects Azure, and like other internal tools like Chaos Engineering, the company is considering bundling some of these AIOps deployment techniques as a service to protect customer workloads. .

Microsoft Defender for Cloud (formerly Azure Security Center) and Sentinel cloud SIEM already use similar machine learning techniques for security, Russinovich noted. “AIOps operates efficiently there to examine the data and determine where there is an incident. [The way] we’ve been using AIOps looking at telemetry inside Azure to understand if there’s a hardware or software regression or crash somewhere that will show up in services supporting the monitoring data we have, like Azure Monitor,” he suggested.

Microsoft already has Azure customers that operate at the same scale as its own internal departments, and large organizations are already using AIOps tools to manage their own infrastructure, so it makes sense to give them those tools to work reliably on the go. cloud scale.

Sherry J. Basler