Wednesday, May 20, 2026·4 min read
PyTorch DataLoader Storage Benchmarks: Throughput That Actually Matters
When storage vendors publish benchmarks, you get pretty numbers: "12 GB/s throughput!" "Million IOPS!" Then you plug the storage into an actual PyTorch training loop and the GPU utilization is 50%. The disconnect is that synthetic benchmarks measure things that don't predict DataLoader performance.
This is a practical guide to benchmarking storage the way DataLoader actually uses it.
Why dd and fio Lie
Classic benchmarks drive sequential reads on a single file from a single process. That's a reasonable proxy for a backup job or a log tail. It's a terrible proxy for a DataLoader, which looks more like:
- Many workers (typically 2-4× GPU count)
- Small-to-medium reads (1KB – 1MB, depending on dataset format)
- Random file access within shards
- Epoch-level repetition over the same dataset
- Bursty concurrency at batch boundaries
A storage layer that wins fio can lose badly on DataLoader workloads if it can't handle high concurrency with small reads.
What to Measure
1. Samples Per Second at Full GPU Concurrency
The only number that really matters. Spin up a training script with a no-op model that just consumes batches:
import time
from torch.utils.data import DataLoader
from webdataset import WebDataset
ds = WebDataset("/mnt/data/train-{000000..001023}.tar").decode().to_tuple("jpg", "cls")
loader = DataLoader(
ds,
batch_size=512,
num_workers=16,
prefetch_factor=4,
pin_memory=True,
)
start = time.time()
count = 0
for batch in loader:
count += batch[0].shape[0]
if count >= 100_000:
break
elapsed = time.time() - start
print(f"{count / elapsed:.0f} samples/sec")
Run this against each storage option you're considering. Match the num_workers, batch_size, and shard layout to your real training config.
2. Time-to-First-Batch
How long from process start to first batch ready? This matters for short jobs and for frequent restarts. Cold-cache storage layers are often 10-100× slower here than warm ones.
3. Cache Hit Ratio Over an Epoch
If your storage has a cache (local NVMe, Training Pipes gateway, etc.), measure the hit ratio after one full epoch. For a well-sized cache you want >95%. If it's lower, either the cache is too small or the eviction policy is wrong for your access pattern.
4. Tail Latency on Individual Reads
Average read latency is misleading. A storage layer that averages 2ms reads but occasionally blocks for 200ms will stall your DataLoader more than one that averages 5ms reads with a flat distribution. Measure P99 and P99.9 on individual file reads under concurrent load.
5. Cross-Epoch Cost
Most training runs sweep the dataset multiple times. Measure the total egress / request cost across an entire run, not a single epoch. Caching architectures look dramatically better here because later epochs are nearly free.
Typical Results
To give you a feel, here's what we see across common setups for a WebDataset-backed ImageNet-scale workload, 8×H100 node, same us-east-1 region:
| Setup | Samples/sec | Time-to-first-batch | Effective cost/epoch |
|---|---|---|---|
| s3fs + S3 Standard | 1,800 | 45s | High (per-request + egress) |
| Local NVMe copy | 14,000 | 8 min (copy) + 3s | Low, but ephemeral |
| EFS Standard | 9,500 | 6s | Very high (capacity-priced) |
| Training Pipes gateway (cold) | 7,200 | 4s | Low |
| Training Pipes gateway (warm) | 13,800 | 3s | Near-zero |
A few things worth noting:
- s3fs is the bottleneck, not S3 itself. The DataLoader can't saturate S3's real throughput through a FUSE daemon.
- Local NVMe is fast once primed. It's the eight-minute cold start that kills it.
- EFS is surprisingly competitive on throughput; it loses on cost, not speed.
- A caching gateway matches local NVMe after warmup but without the copy step.
(Your numbers will vary with dataset size, shard format, region, and workload specifics. Don't trust these — measure your own.)
Mistakes to Avoid When Benchmarking
Benchmarking a Fake Workload
Don't benchmark with num_workers=1 and batch_size=1. That measures a pathological access pattern that no real training job uses.
Ignoring Warm State
Run each benchmark at least twice. The first run measures cold cache + whatever warm-up your storage does. The second run measures steady state. Publishing only the first is unfair; publishing only the second is misleading.
Measuring in the Wrong Region
If your benchmark host is in us-east-1 and your storage is in us-west-2, you're measuring cross-region latency, not the storage itself. Match the regions to what you'd deploy in production.
Forgetting the CPU
DataLoader throughput is sometimes bottlenecked on CPU-side decode (JPEG decode, tensor conversion), not storage. If the CPU is pegged before you max out the disk, you're benchmarking the CPU. Profile with py-spy or cProfile to be sure.
How Training Pipes Does on These Benchmarks
The short version: warm-cache Training Pipes matches local NVMe for DataLoader throughput, and the cold-cache case is ~50% of warm — which is fine because training jobs normally run multiple epochs, and only the first is cold.
The longer version: we publish our own numbers and update them with each release. We also let you measure your own workload against our platform on the free tier, which is the only benchmark we actually trust.