Thursday, May 28, 2026·4 min read

The Hidden Cost of Cross-Region Data Egress in ML Pipelines

Training Pipes Team
World map with network connections

Egress is the thing cloud bills teach you about the hard way. Training pipelines are especially bad offenders: data sitting in one region, compute running in another, and every epoch silently pushing terabytes across the boundary at a few cents per gigabyte.

This post is for platform and ML-ops leads who want to understand where the charges come from and how to stop them.

The Pricing That Bites

The short version of cloud egress pricing:

  • Same AZ: free or very cheap
  • Same region, across AZ: ~$0.01/GB (AWS) to ~free (some providers)
  • Cross-region within a cloud: ~$0.02/GB
  • Out to the internet: ~$0.05-0.09/GB (tiered down with volume)
  • Cross-cloud: always priced like internet egress

One terabyte of cross-region transfer on AWS is about $20. Sounds trivial. Now multiply by your actual ML workload.

The Common Ways Teams Get Hit

1. Training in a Different Region Than Data

You provisioned H100 capacity in us-east-1 because that's where Nvidia had quota. Your data's in us-west-2 because that's where your ingestion pipeline runs. Every epoch reads the dataset, and every epoch's bytes cross the boundary.

A 10TB dataset read across 100 epochs is 1PB of egress. At $0.02/GB that's $20,000 for a single run.

2. Multi-Node Distributed Training Without Local Caching

Eight nodes each independently fetch the same shard because there's no shared cache. Even in-region this is expensive; cross-region it's ruinous.

3. Model Checkpoints Replicated for DR

Checkpoints written in us-east-1, automatically replicated to eu-west-1 for disaster recovery. Reasonable policy, but a 500GB checkpoint every hour = 12TB/day = ~$240/day in replication egress alone.

4. Data Prep in One Region, Training in Another

ETL runs in us-east-1 (where Snowflake lives). Training runs in us-west-2 (where GPUs are). The intermediate dataset shuttles back and forth.

5. Cross-Cloud Strategies Gone Wrong

Raw data in GCS, compute on AWS (or vice versa). Every byte crosses provider boundaries at internet-egress rates. Some teams do this intentionally; most discover it after a quarterly billing review.

The Architecture That Fixes It

The rule is simple: data reads should happen in the same region as the compute.

Whether you accomplish that via replication, caching, or relocation depends on the specifics.

Tactic 1: Colocate Compute and Data

When possible, run training in the region where the data already lives. This is usually the cheapest answer, but quota and capacity often don't cooperate.

Tactic 2: Replicate the Dataset to the Training Region

Keep the canonical copy wherever. Sync to the training region before a run. Works for small-to-medium datasets, breaks down at multi-PB scale.

Tactic 3: Regional Caching Gateway

This is the architecture Training Pipes provides. Your canonical storage can be anywhere. A gateway in the training region caches the hot working set on local NVMe. Your cluster reads from the gateway at LAN speeds; only cache misses cross the region boundary, and only once.

For a typical training workload with a 20% hot ratio, this cuts cross-region egress by 80%+.

Tactic 4: Per-Cluster Local Cache

Each training job copies the data it needs to local NVMe at job start. Works for short jobs; doesn't work when jobs share datasets or when the dataset exceeds local disk.

Worked Example

Let's redo the earlier scenario with different architectures:

Workload: 10TB dataset, 100 epochs, 80% hot / 20% cold reads per epoch, 8 nodes reading in parallel, canonical data in us-west-2, training in us-east-1.

Architecture Cross-region egress Cost @ $0.02/GB
Direct S3 reads, no cache 1,000 TB (100 × 10TB) ~$20,000
Local NVMe copy once per job 10 TB ~$200
Regional caching gateway ~8 TB (hot set + some cold) ~$160
Full dataset replication 10 TB (one-time) + replication lag ~$200 + ongoing

The gateway wins on cost and on flexibility: it doesn't require up-front sync, it handles cold reads on demand, and the cache warms incrementally without blocking the job.

Less Obvious Savings

Cache Persistence Across Runs

A caching gateway that keeps state between runs makes the second run on the same dataset nearly free from an egress perspective. This matters for research teams running many variants of the same base experiment.

Shared Cache Across Clusters

If two training clusters in the same region use the same gateway, they share the cached data. One fetch from the cold tier serves both.

Write Coalescing

Gateways can batch small writes into larger multipart uploads, reducing per-request overhead and sometimes compression opportunities that single-object writes miss.

Lifecycle Integration

Object storage lifecycle rules can move cold data to cheaper tiers (Glacier, Archive) without affecting the gateway. You pay storage-per-GB rates on rarely-used data; the cache handles the hot path.

What to Look For on Your Bill

Cloud bills bury cross-region transfer in several line items:

  • Data Transfer OUT: inter-region transfers to peers
  • Data Transfer INTER-REGION: the symmetric side
  • S3 Request charges: per-GET fees (high for FUSE-style mounts)
  • VPC endpoint / PrivateLink: sometimes charged separately
  • NAT Gateway processed data: gotcha for some VPC egress paths

Cost-explorer breakdowns by region pair are the fastest way to see where the leakage is.

The Training Pipes Angle

When you create a Training Pipes mount, you pick a region. The gateway runs there. If your data lives elsewhere (our managed buckets in another region, or a BYO connection pointing anywhere), cross-region traffic happens once per cache miss, not once per read.

For customers moving from "direct S3 reads across region" to "gateway in training region," we typically see 80-95% reductions in cross-region transfer charges on training workloads. That's usually the first month's ROI.

Put a gateway in your training region →