How We Cut Our Azure Cloud Costs by 3x — Solda.Ai’s Experience

#azure #kubernetes #devops #infrastructure

At Solda.Ai, we build voice AI agents that handle high-volume outbound sales calls. Our platform operates entirely in the cloud, with Kubernetes at the core of our production stack — from call processing to API handling and analytics. As our call volume grows, it’s essential for us to keep infrastructure costs as low as possible to maintain a sustainable and scalable business.

I’m Igor, CTO at Solda.Ai, and together with Dmitrii, our Head of Development, we’ve spent the last few months optimizing our infrastructure for cost-efficiency and scalability. What follows is a breakdown of how we cut our Azure bill by more than 3x— while our traffic was actually growing.

During this period, our outbound traffic actually increased — making the cost reduction even more impactful. Our infrastructure handles hundreds of thousands of outbound calls each month. At one point, our monthly Azure bill was around €25,000, which seriously impacted our budget. In just a few months, we brought it down to €8,000 — here’s exactly how we did it.

Step 0: Fixing requsets and limits (~€ 850)

The very first thing we did — and it may sound boring — was align resources.requests and resources.limits in our Deployments. Without this, autoscaling would keep extra capacity "just in case." Once we tuned the values for our API, call services, admin panel, and analytics, we immediately saw around 3.4% savings - Kubernetes finally knew how much each pod actually needed.

Tip: Don’t even think about scaling until your requests and limits make sense.

Step 1: Enabling nodepool autoscaling (~€ 1,700)

Next, we turned on the Managed Kubernetes Cluster Autoscaler for all nodepools. Previously, even during idle periods, several VMs were always up. Now, if there’s nothing to schedule — the autoscaler brings node count down to zero. That alone gave us ~6.8% savings.

Step 2: Our Go‑based scale‑to‑zero operator (~€ 3,400)

We noticed that Kubernetes HPA never goes below 1 replica — even when there are no calls queued. We initially experimented with CPU-based metrics to drive scaling decisions, but found them too noisy and inconsistent for our use case. So we shifted to a business-level metric — the length of the outbound call queue — which gave us a much more reliable and actionable signal. This helped us scale down aggressively when the system was idle, without sacrificing responsiveness.

So we wrote a custom operator in Go. It monitors the outbound call queue and sets replicas: 0 when the queue is empty. Once pods drop to zero, Azure's autoscaler shuts down the node — saving us ~13.6%. It took some time to deal with race conditions and logs, but it was totally worth it.

If you’re looking to implement a similar pattern, it’s worth noting that tools like KEDA can help you scale workloads based on event sources such as queue length. In our case, we decided to write a custom operator instead — mainly to retain full control and avoid relying on KEDA availability or support in our specific cloud provider setup.

Step 3: Splitting workloads across 6 nodepools (~€ 850)

We avoided the “everything in one bucket” trap by splitting workloads across six dedicated nodepools:

calls (voice bots, D-series)
api (HTTP API, B-series)
admin (admin panel, B-series)
analysis (real-time analysis, E-series)
classification (batch jobs)
misc (auxiliary services)

This allowed for more accurate autoscaling, giving us another ~3.4% savings.

Step 4: Spot VMs for non-critical jobs (~€ 10,200)

Our biggest win came from switching non-real-time workloads — call classification, background analysis, and even parts of the API — to Spot VMs. These are significantly cheaper but can be evicted at any time. We built in retry logic and increased API replicas so the system could continue even if Spot VMs got pulled. That alone brought ~40.8% savings on eligible workloads.

Important: Always check the pricing and availability of Spot VMs in your region. Some aren’t much cheaper than standard VMs; others live only 5–10 minutes — not ideal if your job takes longer.

Summary: From €25,000 to €8,000 in a few months

Each step built on the previous one, so the percentages don’t add up linearly. Here’s a simplified view of where the savings came from:

Step 0: Fixing requests/limits — ~3.4%
Step 1: Cluster autoscaling — ~6.8%
Step 2: Scale-to-zero operator — ~13.6%
Step 3: Splitting nodepools — ~3.4%
Step 4: Spot VMs — ~40.8%
Total estimated savings: ~68%

What’s even more impressive is that Solda.ai's total outbound call volume grew during this time — we managed to cut costs despite handling more traffic.

We brought our bill down from €25,000 to €8,000 without compromising SLA or system stability.

The key: monitor everything, scale smart, and automate aggressively.