Thursday, May 28, 2026·4 min read
The Hidden Cost of Cross-Region Data Egress in ML Pipelines
Egress is the thing cloud bills teach you about the hard way. Training pipelines are especially bad offenders: data sitting in one region, compute running in another, and every epoch silently pushing terabytes across the boundary at a few cents per gigabyte.
This post is for platform and ML-ops leads who want to understand where the charges come from and how to stop them.
The Pricing That Bites
The short version of cloud egress pricing:
- Same AZ: free or very cheap
- Same region, across AZ: ~$0.01/GB (AWS) to ~free (some providers)
- Cross-region within a cloud: ~$0.02/GB
- Out to the internet: ~$0.05-0.09/GB (tiered down with volume)
- Cross-cloud: always priced like internet egress
One terabyte of cross-region transfer on AWS is about $20. Sounds trivial. Now multiply by your actual ML workload.
The Common Ways Teams Get Hit
1. Training in a Different Region Than Data
You provisioned H100 capacity in us-east-1 because that's where Nvidia had quota. Your data's in us-west-2 because that's where your ingestion pipeline runs. Every epoch reads the dataset, and every epoch's bytes cross the boundary.
A 10TB dataset read across 100 epochs is 1PB of egress. At $0.02/GB that's $20,000 for a single run.
2. Multi-Node Distributed Training Without Local Caching
Eight nodes each independently fetch the same shard because there's no shared cache. Even in-region this is expensive; cross-region it's ruinous.
3. Model Checkpoints Replicated for DR
Checkpoints written in us-east-1, automatically replicated to eu-west-1 for disaster recovery. Reasonable policy, but a 500GB checkpoint every hour = 12TB/day = ~$240/day in replication egress alone.
4. Data Prep in One Region, Training in Another
ETL runs in us-east-1 (where Snowflake lives). Training runs in us-west-2 (where GPUs are). The intermediate dataset shuttles back and forth.
5. Cross-Cloud Strategies Gone Wrong
Raw data in GCS, compute on AWS (or vice versa). Every byte crosses provider boundaries at internet-egress rates. Some teams do this intentionally; most discover it after a quarterly billing review.
The Architecture That Fixes It
The rule is simple: data reads should happen in the same region as the compute.
Whether you accomplish that via replication, caching, or relocation depends on the specifics.
Tactic 1: Colocate Compute and Data
When possible, run training in the region where the data already lives. This is usually the cheapest answer, but quota and capacity often don't cooperate.
Tactic 2: Replicate the Dataset to the Training Region
Keep the canonical copy wherever. Sync to the training region before a run. Works for small-to-medium datasets, breaks down at multi-PB scale.
Tactic 3: Regional Caching Gateway
This is the architecture Training Pipes provides. Your canonical storage can be anywhere. A gateway in the training region caches the hot working set on local NVMe. Your cluster reads from the gateway at LAN speeds; only cache misses cross the region boundary, and only once.
For a typical training workload with a 20% hot ratio, this cuts cross-region egress by 80%+.
Tactic 4: Per-Cluster Local Cache
Each training job copies the data it needs to local NVMe at job start. Works for short jobs; doesn't work when jobs share datasets or when the dataset exceeds local disk.
Worked Example
Let's redo the earlier scenario with different architectures:
Workload: 10TB dataset, 100 epochs, 80% hot / 20% cold reads per epoch, 8 nodes reading in parallel, canonical data in us-west-2, training in us-east-1.
| Architecture | Cross-region egress | Cost @ $0.02/GB |
|---|---|---|
| Direct S3 reads, no cache | 1,000 TB (100 × 10TB) | ~$20,000 |
| Local NVMe copy once per job | 10 TB | ~$200 |
| Regional caching gateway | ~8 TB (hot set + some cold) | ~$160 |
| Full dataset replication | 10 TB (one-time) + replication lag | ~$200 + ongoing |
The gateway wins on cost and on flexibility: it doesn't require up-front sync, it handles cold reads on demand, and the cache warms incrementally without blocking the job.
Less Obvious Savings
Cache Persistence Across Runs
A caching gateway that keeps state between runs makes the second run on the same dataset nearly free from an egress perspective. This matters for research teams running many variants of the same base experiment.
Shared Cache Across Clusters
If two training clusters in the same region use the same gateway, they share the cached data. One fetch from the cold tier serves both.
Write Coalescing
Gateways can batch small writes into larger multipart uploads, reducing per-request overhead and sometimes compression opportunities that single-object writes miss.
Lifecycle Integration
Object storage lifecycle rules can move cold data to cheaper tiers (Glacier, Archive) without affecting the gateway. You pay storage-per-GB rates on rarely-used data; the cache handles the hot path.
What to Look For on Your Bill
Cloud bills bury cross-region transfer in several line items:
- Data Transfer OUT: inter-region transfers to peers
- Data Transfer INTER-REGION: the symmetric side
- S3 Request charges: per-GET fees (high for FUSE-style mounts)
- VPC endpoint / PrivateLink: sometimes charged separately
- NAT Gateway processed data: gotcha for some VPC egress paths
Cost-explorer breakdowns by region pair are the fastest way to see where the leakage is.
The Training Pipes Angle
When you create a Training Pipes mount, you pick a region. The gateway runs there. If your data lives elsewhere (our managed buckets in another region, or a BYO connection pointing anywhere), cross-region traffic happens once per cache miss, not once per read.
For customers moving from "direct S3 reads across region" to "gateway in training region," we typically see 80-95% reductions in cross-region transfer charges on training workloads. That's usually the first month's ROI.