How We Cut a SaaS Company's AWS Bill by 62%

The CTO sent us their last three AWS invoices before our first call. $44K, $46K, $47K — climbing about 6% per month. The company had 38 engineers, was growing well, and had been on AWS for four years. Nobody on the team had a clear picture of where the money was going. The answer they'd been giving investors was "infrastructure scales with usage." That was true. What was also true was that a significant chunk of the bill had nothing to do with usage at all.

Six weeks later, their bill was $18K. No application changes. No features cut. No corners taken on reliability. Here's the breakdown.

The audit: what we found

We start every FinOps engagement the same way: two days of pure read access, no changes, just understanding. AWS Cost Explorer, Trusted Advisor, CloudWatch metrics, and a lot of questions about what each service actually does in production. What we found was not unusual — it's what we find in almost every four-year-old AWS account.

1. Severely oversized EC2 instances

Fourteen EC2 instances in production, all selected during a period of rapid growth 18 months prior. None had been revisited. CloudWatch showed average CPU utilization across the fleet at 8%. The largest instance — an r5.4xlarge running their analytics service — averaged 4% CPU and 18% memory. It was running a workload that comfortably fit on an r5.large. This pattern repeated across the fleet: instances sized for peak load that never materialized, still running at that size two years later.

2. RDS instances that predated two product pivots

Three RDS instances — two PostgreSQL, one MySQL — that nobody on the current team could explain the purpose of. One had zero connections in the past 90 days. One was a replica of a database that had been migrated to a different region eight months ago and never decommissioned. The third was a staging environment that the team had stopped using when they moved to an ephemeral database strategy, but the instance kept running.

3. $8,400/month in NAT gateway data transfer

This one surprised the team. NAT gateway charges $0.045 per GB of data processed. Their EKS workloads were pulling container images from ECR and packages from S3 through the NAT gateway on every pod startup — including every Kubernetes node scaling event, every deployment rollout, every crash restart. Nobody had set up VPC endpoints for S3 and ECR, so all of that traffic was being metered at NAT rates. $8,400 a month for traffic that should cost essentially nothing.

4. EBS snapshots accumulating for three years

2,847 EBS snapshots. Automated backups running daily with no retention policy — or rather, a retention policy that had been set to "never delete" during a scare about data loss three years prior and never revisited. Most of these snapshots were of instances that no longer existed. At $0.05/GB-month, this had become a meaningful line item.

5. Dev and staging environments running 24/7

Full production-equivalent environments for development and staging, running around the clock. Engineers worked core hours in one time zone. The environments were idle 65% of the time. No scheduled scale-down, no instance hibernation, nothing.

6. Zero use of Spot Instances or Savings Plans

100% on-demand pricing across the entire fleet. For a four-year-old company with predictable baseline workloads, this was the most straightforward money left on the table.

What we changed

Right-sizing the EC2 fleet (week 1–2)

We right-sized all fourteen instances based on 90-day CloudWatch utilization data — not theoretical peak load. The rule we use: size for the 95th percentile of actual observed utilization, with one instance size of headroom. For most of their workload, that meant dropping two to three instance sizes. We used AWS Compute Optimizer's recommendations as a starting point, then validated each change against actual application performance metrics before committing.

The migration strategy was rolling: blue/green swap with the new instance size, 24 hours of monitoring, then decommission the old one. Total engineering disruption: minimal. We did this during business hours with the team watching dashboards. No midnight windows.

Monthly savings from right-sizing: $9,200

Decommission the zombie RDS instances (week 1)

Final snapshots taken, instances terminated. We first confirmed with every team that none of these databases were in use — which required two engineering all-hands Slack messages and one 15-minute meeting. For the replica that had been running for 8 months post-migration, we compared row counts between the source and the replica to confirm the replica had no newer data before terminating it.

Monthly savings from RDS: $2,100

VPC endpoints for S3 and ECR (day 3)

This was the highest ROI change in the entire engagement and took about two hours to implement. Gateway endpoints for S3 are free. Interface endpoints for ECR cost about $14/month per AZ. The NAT gateway bill for those services dropped to essentially zero within 48 hours.

Monthly savings from VPC endpoints: $8,100 (from $8,400 to ~$300 residual)

EBS snapshot lifecycle policy (week 2)

AWS Data Lifecycle Manager policy: keep the 7 most recent daily snapshots, delete everything older. For the 2,847 accumulated snapshots, we ran a targeted cleanup script — but carefully. We cross-referenced every snapshot against existing volumes, flagged anything that looked like it might be a one-time manual backup, and confirmed with the team before deleting. The actual deletion run took about 40 minutes. The storage bill for snapshots went from $1,800/month to under $200.

Monthly savings from snapshot cleanup: $1,600

Scheduled scale-down for dev/staging (week 3)

EventBridge Scheduler rules to stop all dev and staging EC2 and RDS instances at 8pm local time and start them at 7am, Monday through Friday. Weekend shutdown too. We wrote a small Lambda that sends a Slack notification 15 minutes before scheduled shutdown so engineers could override if they needed to run something overnight. Override rate in the first month: 4 times. Not worth leaving everything on 24/7 for that.

Monthly savings from scheduled shutdown: $3,400

Savings Plans for baseline compute (week 4–5)

After right-sizing, we had a stable picture of their baseline compute need. We purchased 1-year Compute Savings Plans covering 70% of their normalized baseline. Savings Plans over on-demand pricing: 30–36% on their instance mix. We deliberately left 30% on on-demand to cover growth and variable workloads — Savings Plans are a commitment, and you don't want to over-commit to a baseline that might shrink.

Monthly savings from Savings Plans: $4,800

The results

  • Month 1 bill after changes: $18,200 (from $47,000)
  • Total monthly savings: $28,800 (62%)
  • Annualized savings: ~$345,000
  • Zero reliability incidents from any of the changes
  • Engineering time saved: dev/staging environments now start fresh daily, which caught two environment drift bugs the team had been living with for months

"I knew we were overspending but I assumed it was 15–20%. When you showed me the NAT gateway line I just sat there for a minute. Eight thousand dollars a month for traffic that should be free."

The pattern we see every time

This engagement was typical. Not because companies are careless — but because AWS billing complexity is genuinely designed to obscure costs, and the people with the most context on infrastructure are also the people with the least time to audit it. The traps are consistent: instances sized for theoretical peaks, zombie resources from old experiments, data transfer costs nobody tracks, dev environments that run like production.

The fixes are also consistent. None of them require exotic tooling or significant engineering time. The hardest part is the audit — getting an honest picture of what you're running and why — not the remediation. We've now run this process across SaaS, FinTech, and HealthTech companies at every stage. The ROI is almost always 40–70% of the current bill, with a payback period measured in weeks, not quarters.

Where to start if you're doing this yourself

If you're not ready to bring in outside help, start here:

  1. Open AWS Cost Explorer and sort by service for the last 90 days. Find the top 5 line items. Understand each one before touching anything.
  2. Check your EC2 instances in Compute Optimizer. If any show "Over-provisioned" with high confidence, that's low-hanging fruit.
  3. Check your NAT gateway spend in Cost Explorer. If it's material, check whether you have VPC endpoints for S3 and ECR — most teams don't.
  4. List your RDS instances and ask: when was each one last connected to? AWS doesn't make this obvious, but DatabaseConnections in CloudWatch does.
  5. Don't touch Savings Plans until you've right-sized. Committing to the wrong baseline locks you into savings on instances you're about to delete.

Want to know where your AWS spend is going?

We'll run a free cost audit and tell you exactly what's driving your bill — and what's safe to cut without touching reliability.

Book Free Audit
← Back to all articles