Monday, May 4, 2026·4 min read

AWS EFS vs Training Pipes: A Cost Breakdown for ML Workloads

Training Pipes Team
Financial charts and calculator representing cost analysis

When a team tells us "we use EFS for training data," we ask what their monthly bill looks like. The numbers are usually larger than they expect. This post walks through a concrete cost comparison between AWS EFS and Training Pipes for a realistic ML training workload.

The Scenario

Let's model a common setup:

  • Dataset: 50 TB of training data (sharded WebDataset tars)
  • Hot working set: 8 TB (the shards a given run actually touches)
  • Training: one large run per week, plus ad-hoc experiments
  • Cluster: 8 × H100 nodes in us-east-1
  • Data origin: canonical copy lives in S3

We'll compare three ways to feed this data to the GPUs.

Option 1: S3 Only (No Caching)

Pure S3 access with a FUSE mount or native S3 reads from the DataLoader.

Monthly costs:

  • S3 Standard storage (50 TB): $1,150
  • GET requests (estimating 100M/month across all runs): $40
  • In-region data transfer: $0 (same region)

Monthly total: ~$1,190

Cheap! But the GPU utilization is 40-60% because the DataLoader is waiting on S3 latency. With $25k/month of GPU spend, a 40% idle rate is $10,000/month of wasted compute.

Option 2: AWS EFS (Standard)

Move the dataset into EFS and mount it on each node.

Monthly costs:

  • EFS Standard storage (50 TB @ $0.30/GB): $15,360
  • EFS read throughput (Elastic mode, ~$0.03/GB): depends heavily on access patterns, estimating 400 TB read/month = $12,000
  • Data sync from S3 to EFS: one-time cost, but engineering time adds up

Monthly total: ~$27,360+

You get POSIX semantics. You also pay for 50 TB of provisioned filesystem capacity even though only 8 TB is hot at any time. Infrequent Access storage class helps a bit, but moves data in and out on access, which complicates your pipeline.

Option 3: Training Pipes (Managed Bucket + Regional Gateway)

Keep the canonical copy in object storage (either managed by us or your existing S3 via BYO). A regional gateway in us-east-1 caches the hot 8 TB on NVMe and serves NFS to your cluster.

Monthly costs:

  • Object storage (50 TB): ~$750
  • Gateway + NFS mount in us-east-1: included in plan tier (Pro: $49/mo handles this workload size)
  • Cache storage: included
  • No per-request fees for cached reads

Monthly total: ~$800-1,200 (depending on tier)

You get real NFS. You get caching. You don't pay EFS prices for data you don't touch.

The Apples-to-Apples Comparison

Metric S3 Only EFS Training Pipes
Monthly storage + access $1,190 $27,360 ~$1,100
POSIX filesystem? No (FUSE is a lie) Yes Yes
GPU utilization 40-60% 85-95% 85-95%
Wasted GPU spend/month ~$10k ~$0 ~$0
True monthly cost ~$11,190 ~$27,360 ~$1,100
Works with non-AWS data? S3-compatible only AWS only Any cloud
Locked to AWS? Yes Yes No

Why the EFS Bill Is So High

EFS prices storage by provisioned capacity, not by hot working set. If your dataset is 50 TB, you pay for 50 TB of filesystem — even if 80% of it is rarely touched.

Caching gateways flip this model. Object storage handles the cheap bulk capacity. NVMe-backed cache handles the hot read path. You pay cold-storage rates for 50 TB and cache-accelerator rates for 8 TB.

Caveats and Fine Print

  • EFS IA (Infrequent Access) reduces storage costs significantly (~$0.025/GB) but charges per read. For training workloads that reread data, this often makes the bill worse.
  • FSx for Lustre offers higher throughput than EFS but is even more expensive per provisioned TB and is designed for HPC, not general ML.
  • Training Pipes Pro tier covers the workload in this example; larger deployments may need Enterprise. Our pricing page has current numbers.
  • Your mileage will vary. The actual crossover point depends on your dataset's hot/cold ratio, your run cadence, and your cluster region.

The Broader Point

EFS is a decent managed NFS filesystem. It's a poor cache. Most ML workloads have a small hot working set relative to their full dataset, and they benefit enormously from caching infrastructure that's priced accordingly.

The architecture you want is:

  • Durable tier: object storage, priced per-GB at cold rates
  • Hot tier: NVMe-backed cache sized to working set
  • Access layer: NFS or SMB over the cache

Training Pipes bundles the last two into a managed service. Your canonical data can live in our managed buckets, in your existing S3 bucket via BYO, or anywhere S3-compatible. The gateway + cache + protocol layer is the same.

See the math for your workload →