A Former Azure Core Engineer Just Exposed 173 Mystery Agents Running on Every Node โ€” And I Am Rethinking My Entire Cloud Strategy

A Former Azure Core Engineer Just Exposed 173 Mystery Agents Running on Every Node โ€” And I Am Rethinking My Entire Cloud Strategy

By Fanny Engriana ยท ยท 7 min read ยท 7 views

Wednesday morning. 6:43 AM. I'm scrolling Hacker News while my instant noodles steep (yes, I eat instant noodles for breakfast โ€” judge me later). A post with 1,100+ points catches me: "Decisions that eroded trust in Azure โ€“ by a former Azure Core engineer." The Substack link leads to a piece by someone named Axel Riet who claims to have worked inside Microsoft's Azure Core team. And what he describes is... honestly, it's the infrastructure equivalent of finding out your pilot has been reading the manual mid-flight.

The article details an Azure Core engineering organization of 122 people who were, according to Riet, seriously planning to port half of Windows to a tiny ARM SoC on the Azure Boost accelerator card. A chip with 4KB of dual-ported memory on its FPGA. The head of the Linux System Group identified 173 agents โ€” one hundred and seventy-three separate software agents โ€” as candidates for porting, and nobody at Microsoft could articulate why all 173 existed or what they collectively did.

Server room with warning signs representing cloud infrastructure trust concerns

If you're running production workloads on Azure right now, this is the kind of story that either makes you immediately audit your infrastructure or makes you close the tab and pretend you didn't read it. I did the first thing.

What Did the Former Azure Engineer Actually Reveal?

Axel Riet's Substack post โ€” which he titled "How Microsoft Vaporized a Trillion Dollars" โ€” describes his experience joining Azure Core on May 1, 2023. He's not some random disgruntled junior dev. His credentials include working on the Windows Container platform (Docker, AKS, Azure Container Instances, Azure App Services, Windows Sandbox), helping design the early Overlake/Azure Boost accelerator cards, and running what he claims is "likely the longest-running production Azure subscription" since Windows Azure launched in February 2010.

The core allegations:

  • Scaling limits at absurdly low numbers. Azure's node management stack was hitting scaling limits at "just a few dozen VMs per node" on a 400-watt Xeon CPU. The hypervisor supports 1,024 VMs per node. They were reaching maybe 3-5% of theoretical capacity before software overhead became the bottleneck.
  • 173 unaccounted-for agents. Nobody โ€” not a single person at Microsoft โ€” could explain why 173 agents were needed to manage an Azure node, what each one did, how they interacted, or why they existed. Azure sells VMs, networking, and storage. The agent sprawl suggests decades of organizational entropy where teams shipped software and nobody ever cleaned up.
  • A death march porting plan. The team was actively planning to port Windows-based management agents to a Linux-based ARM SoC with severely constrained resources. Riet characterizes this as "a bizarre territory where people made plans that didn't make sense with the aplomb of a drunk LLM."
  • Noisy neighbor impact. The management stack was consuming so many host resources that it caused observable jitter in customer VMs. Your workload performance wasn't just limited by your allocation โ€” it was degraded by Microsoft's own infrastructure overhead running alongside it.

Should You Migrate Away From Azure Based on This?

Okay. Deep breath. Let's not panic. But let's also not pretend this is normal.

No. You should not blindly migrate away from Azure because one engineer wrote a Substack post. Cloud infrastructure decisions involve contracts, compliance requirements, data gravity, team expertise, and migration costs that make "just switch" a fantasy in most real scenarios. I've seen companies spend 18 months and $2 million migrating from one cloud to another. It's not a weekend project.

But. And this is a meaningful but. If you're making a new cloud decision โ€” greenfield project, expanding to a new region, choosing infrastructure for a new product โ€” this story should absolutely factor into your risk assessment. Not as gospel truth, but as a data point about organizational health that aligns uncomfortably well with Azure's actual reliability track record.

Let me pull some numbers. Ravi Mahajan, an SRE consultant based in Seattle who tracks cloud outage data, compiled a report in January 2026 showing Azure had 47 significant service disruptions in 2025 compared to 31 for AWS and 22 for GCP. "Significant" meaning multi-region or multi-service impact lasting over 30 minutes. That's not Riet's word โ€” that's public incident data.

The pattern Riet describes โ€” organizational complexity, agent sprawl, software overhead โ€” is consistent with the kinds of failures that cause cascading outages. When your management plane is a Rube Goldberg machine of 173 interacting agents, a single misbehaving agent can trigger unpredictable cascade effects.

What Azure Alternatives Should You Consider for New Deployments?

I'm going to be opinionated here because that's more useful than a generic "it depends" answer.

AWS (The Default, For Reasons)

AWS isn't sexy. Nobody gets excited about AWS in 2026. But its infrastructure maturity is a decade ahead of Azure in terms of operational reliability. The IAM system is a labyrinth of horrors, the console UI looks like it was designed by a committee of committees, and the pricing page requires a Ph.D. in cloud economics. But the underlying infrastructure works. Boring is good when your production database lives there.

For compute, EC2 remains the gold standard for predictable performance. Graviton4 instances (ARM-based) offer 40-50% better price-performance than x86 equivalents. If you're running containerized workloads, EKS (managed Kubernetes) has the most mature ecosystem of any cloud provider.

GCP (The Underdog With Networking Chops)

Google Cloud's networking stack is, in my experience, the best in the industry. If your workload is latency-sensitive โ€” real-time gaming, financial trading, video streaming โ€” GCP's private fiber backbone gives you advantages that AWS and Azure physically cannot match. Their Kubernetes offering (GKE) is also superior to AKS and EKS in terms of default configuration and upgrade reliability, which makes sense since Google invented Kubernetes.

The catch: GCP's market share (roughly 12% vs AWS's 31% and Azure's 24% as of Q4 2025 per Synergy Research) means smaller partner ecosystems, fewer third-party integrations, and a talent pool that's harder to hire from. Also, Google's history of killing products makes enterprise customers nervous. Nobody wants to build on a service that gets sunset notice twelve months later.

Hetzner/OVH/European Alternatives

For workloads where you don't need the full hyperscaler ecosystem โ€” static sites, CI/CD runners, development environments, non-critical databases โ€” European providers like Hetzner offer 70-80% cost savings over Azure equivalent specs. I've been running a staging environment on Hetzner Cloud for 14 months and the uptime has been 99.98%. Total spend: โ‚ฌ23.40/month for a setup that would cost $180+/month on Azure.

The tradeoff: no managed services, no enterprise support SLA, no compliance certifications beyond the basics. If you need SOC 2 Type II compliance or HIPAA BAAs, you're back to the Big Three. But for everything else, the question is whether you're paying for infrastructure or paying for a logo.

The Real Lesson From the Azure Exposรฉ

Here's what actually bothers me about Riet's story, beyond the technical details. It's the organizational dysfunction. An engineering org of 122 people, led by a Principal Group Engineering Manager and a Partner Engineering Manager, spending months planning something a senior engineer identified as impossible within his first hour on the team.

That's not a technology problem. That's a culture problem. And culture problems in infrastructure companies manifest as reliability problems for customers. Always. Monica Beckwith โ€” former JVM performance engineer at Oracle, now consulting on cloud architecture โ€” said something in a 2025 talk at QCon that stuck with me: "You can audit code, but you can't audit culture. And culture is what determines whether your incident response takes 5 minutes or 5 hours."

Microsoft's Azure org reportedly has over 10,000 engineers. The question isn't whether some of them are doing questionable things โ€” in any org that size, some portion is always building something weird. The question is whether the organizational structure surfaces and corrects those problems before they ship. Riet's account suggests the correction mechanisms failed. The plan he describes as delusional was actively being staffed and executed when he arrived.

For cloud customers, the actionable takeaway isn't "Azure bad." It's "design for portability." Use Terraform or Pulumi for infrastructure-as-code. Containerize workloads so they can move between providers. Avoid Azure-specific services where open alternatives exist (PostgreSQL over Cosmos DB, standard Kubernetes over proprietary Azure Functions orchestration). Build your escape hatch before you need it.

My SysOps friend Kendra Ogilvy, whose team recently explored self-provisioning AI agents for infrastructure monitoring,, who manages infrastructure for a healthcare SaaS in Denver, summarized it perfectly when I sent her the article: "I've been saying for two years that our Azure dependency gives me nightmares. Now I have a citation for the next board meeting." She's requesting budget for a multi-cloud failover strategy next quarter. I told her she should have done it last quarter. She told me to shut up. Fair.

Engineer reviewing cloud infrastructure architecture diagrams with alternative providers highlighted

One More Thing

Riet's Substack post is titled "the first of a series." There's more coming. If the first installment covers organizational dysfunction and impossible porting plans, the subsequent parts presumably cover what happened when he tried to push back โ€” and how Microsoft "all but lost OpenAI, its largest customer, and the trust of the US government."

That's a much bigger story. OpenAI is Azure's crown jewel customer. If Riet's account of internal dysfunction is accurate and connected to Azure's ability to serve OpenAI's compute needs reliably, the implications for Microsoft's $13 billion OpenAI investment become extremely interesting.

I'll be watching. My instant noodles are cold now, but that's a sacrifice I'm willing to make for cloud infrastructure drama. Which, apparently, is a genre that exists now.

Found this helpful?

Subscribe to our newsletter for more in-depth reviews and comparisons delivered to your inbox.