The Hidden Costs of AWS: What You Don't Know Can Hurt Your Budget

Cloudtrim Team · 2026-02-25 · 8 min read

The hidden costs of AWS: what you don't know can hurt your budget

You're watching your EC2 instances. You've right-sized your databases. You're keeping an eye on the obvious line items.

And yet the bill keeps climbing.

The problem is usually what you're not watching. AWS has dozens of billing dimensions, and the most expensive surprises tend to hide in services nobody thinks to check. A badly placed NAT Gateway. Verbose logs nobody turned off. Cross-AZ traffic from a highly available architecture you're proud of.

This covers the hidden costs we see most often when scanning accounts.

Why cloud waste is worse than you think

27% of cloud infrastructure spend is wasted, according to Flexera's 2025 State of the Cloud Report, which surveyed 759 IT professionals and executives.

That means $27 out of every $100 is going nowhere useful. And it's not improving much: Flexera tracked this figure at 32% four years ago. Four years of FinOps investment across the industry, and waste dropped by 5 percentage points.

Harness surveyed 700 engineering leaders and found that organizations take an average of 31 days to identify and eliminate cloud waste after it starts accumulating. $44.5 billion in cloud infrastructure will be wasted in 2025 alone by their projection. That's a month of burning money before anyone notices — and another month to fix it.

HashiCorp and Forrester surveyed roughly 1,200 technology practitioners and found that 91% reported experiencing some cloud waste. The two leading causes were consistent: overprovisioning resources and leaving idle resources running.

Cloud waste isn't a niche problem affecting disorganized teams. It's the default state of the industry.

NAT Gateway: the most expensive surprise in your VPC

NAT Gateway charges $0.045/GB for data processing on top of the standard $0.09/GB internet egress fee, putting the real cost at $0.135/GB for internet-bound traffic.

That's 3x the advertised egress rate. Most teams see "$0.09/GB" in AWS pricing and budget accordingly, without realizing NAT Gateway adds another layer of charges on every byte that flows through it.

The worst scenario is routing S3 traffic through a NAT Gateway. S3 is a public AWS endpoint, but it's still within the AWS network. If your EC2 instances or Lambda functions pull from S3 through a NAT Gateway instead of a VPC endpoint, you pay NAT data processing fees on every byte, for no reason. One documented case generated $907 in NAT Gateway charges in a single day before anyone caught it.

The fix is free and takes about ten minutes: create an S3 Gateway VPC Endpoint. It routes S3 traffic privately within AWS, bypassing NAT Gateway entirely. DynamoDB has the same option.

Create S3 and DynamoDB Gateway VPC Endpoints — they're free

Check CloudWatch for NAT Gateway BytesOutToDestination vs BytesOutFromDestination to spot misrouted traffic

Audit whether dev and staging environments have NAT Gateways that could use a less expensive setup

Data transfer: the bill that shows up after launch

Cross-availability-zone data transfer accounts for approximately half of all data transfer costs, based on Datadog's analysis of real AWS billing data from hundreds of organizations.

This one hurts because it's the direct cost of doing things right. Multi-AZ deployments are the correct architecture for resilient systems. But every byte that travels between availability zones — app server to database, service to service gets charged at $0.01/GB in each direction.

In a high-traffic application, this adds up fast. A service making 10,000 cross-AZ calls per second at 10KB each generates roughly 4TB of cross-AZ traffic daily. That's $80/day or $2,400/month in data transfer charges that don't show up in your EC2 or RDS cost lines. They're buried in the EC2-Other category.

Teams typically discover this six months after deploying a new service, when the bill has grown and the architecture is entrenched.

Pull the aws_datatransfer_type dimension in Cost Explorer to see the cross-AZ vs internet breakdown

Use same-AZ affinity for non-critical internal traffic where latency tolerates it

VPC endpoints for AWS services (S3, DynamoDB, SQS, SNS) reduce traffic that otherwise routes through the public internet

Idle and oversized resources: still the biggest category

CAST AI's 2025 Kubernetes Cost Benchmark analyzed over 2,100 organizations and found average CPU utilization of just 10%, down from 13% the prior year.

This isn't just a Kubernetes problem. Datadog's analysis of hundreds of organizations' AWS billing data found that 83% of container costs come from idle resources: 54% from overprovisioned cluster infrastructure and 29% from oversized workload resource requests. The same report found 83% of organizations still spend an average of 17% of their EC2 budgets on previous-generation instance types — not oversized, just old. Current-generation equivalents typically cost 10–15% less and perform better.

These aren't edge cases. They're the median. The reason they persist is that right-sizing requires continuous monitoring and a willingness to act on the data. A resource appropriately provisioned at launch may be dramatically oversized 18 months later when the workload changed.

Review 14-day CPU and memory utilization in CloudWatch for all production instances

Flag anything with consistently less than 20% average CPU utilization for right-sizing review

Use AWS Compute Optimizer for automated right-sizing recommendations

Schedule non-production environments to stop outside working hours, this alone cuts dev/staging costs by 60–70%

Lambda: the billing rules changed in 2025

As of August 2025, AWS began billing for the Lambda INIT phase cold start initialization time that was previously free.

This applies to on-demand ZIP-packaged Lambda functions using managed runtimes. AWS stated that most users would see minimal impact, but functions with heavy initialization logic can see Lambda spend increase by 10–50%. Java and .NET functions that load large dependency trees at startup are most affected. So are functions that establish database connection pools during cold start.

The change is easy to miss because it doesn't appear as a new line item. It shows up as higher-than-expected duration charges, and INIT duration only appears in CloudWatch Logs if you're looking for it specifically in the REPORT log lines.

If you manage Lambda functions with Java runtimes or large deployment packages, pull a CloudWatch Logs Insights query for REPORT lines and compare INIT Duration to Duration. Functions where INIT Duration consistently exceeds invocation Duration are now materially more expensive than they were six months ago.

Query CloudWatch Logs Insights: filter @message like "REPORT" | parse "Init Duration: @init" | stats avg(@init)

Consider Lambda SnapStart for Java functions to eliminate repeated cold start overhead

Container image-based Lambda functions are not affected by this billing change

CloudWatch logs: the line item nobody audits

CloudWatch Logs ingestion costs $0.50/GB for EC2, EKS, and RDS logs, a flat rate that applies regardless of volume or relevance.

AWS introduced tiered pricing for Lambda-generated logs in May 2025, with rates dropping from $0.50/GB to as low as $0.05/GB at scale. The Duckbill Group estimated this saves large enterprises over 50% on their Lambda log spend. The discount only applies to Lambda. Everything else stays at the flat rate.

The source of log bloat is almost always the same: verbose log levels set during development that never got tuned for production. DEBUG logging in a high-traffic service generates roughly ten times the log volume of INFO. A service handling 1,000 requests per minute logging 2KB request and response payloads for debugging — debugging that nobody does anymore — generates 2.9GB of logs per day. That's $44/month per service, just in ingestion. Storage charges on top.

And that's before retention. Most teams set a log retention policy once during initial setup, then forget it. CloudWatch log groups default to never expire.

Audit active log groups across all regions with the AWS CLI: aws logs describe-log-groups --query 'logGroups[?retentionInDays==`null`]'

Set retention to 30–90 days for most production logs; 7 days for dev environments

Switch from DEBUG to INFO in production service logging

Filter high-volume low-value log entries at the agent level before they reach CloudWatch

The compound effect: why one-off audits don't work

McKinsey's analysis of more than $3 billion in cloud spending found that even organizations that had already done basic optimization had 10–20% in additional untapped savings.

One-off audits have a short shelf life. An instance you right-sized in January is oversized again by March when the project wound down. A NAT Gateway deployed for a migration is still running six months after the migration completed. Log groups from a decommissioned service accumulate indefinitely because no one ran a cleanup.

Cloud environments change constantly. Engineers spin up infrastructure in minutes, projects end without teardowns, services scale up and never scale back. The Harness finding: 31 days on average to identify waste after it starts, means you're always playing catch-up.

The organizations that manage this well treat it as a continuous process, not a quarterly exercise. They have automated detection for idle resources, scheduled anomaly reports, and clear cost ownership by team.

84% of organizations say managing cloud spend is their top cloud challenge; Flexera, 2025.

The difficulty is real. But it's primarily an organizational and tooling problem, not a technical one. The data to identify this waste exists in AWS - it's a question of whether you're systematically reading it.

What to do next

None of the costs covered here show up in the default AWS Cost Explorer view. You have to know to look.

Four specific audits worth running this week:

NAT Gateway: compare data processed vs your expected internet egress. A large discrepancy usually means misrouted S3 or internal service traffic.

Cross-AZ transfer: pull the aws_datatransfer_type breakdown in Cost Explorer. If cross-AZ charges dwarf your internet egress, you have optimization opportunities in your service mesh.

CloudWatch log groups: list all groups with no retention policy set. These will grow indefinitely.

Lambda REPORT logs: if you have Java or .NET functions, query for INIT Duration. Functions where INIT consistently exceeds Duration are candidates for SnapStart or refactoring.

Then set up ongoing monitoring so these costs don't creep back. The AWS bill is too dynamic and too complex to review manually each month across a portfolio of accounts.

Cloudtrim automates this - scanning your accounts daily, identifying the exact resources generating waste, and generating the CloudFormation or Terraform to fix it. Infrastructure code you can review and apply, not a list of suggestions.

Ready to cut your AWS costs?

Cloudtrim finds waste across your AWS accounts and generates the IaC to fix it. See it in action with our interactive demo.

Try Cloudtrim Free