When storage vendors publish benchmarks, you get pretty numbers: "12 GB/s throughput!" "Million IOPS!" Then you plug the storage into an actual PyTorch training loop and the GPU utilization is 50%. The disconnect is that synthetic benchmarks measure things that don't predict DataLoader performance.

This is a practical guide to benchmarking storage the way DataLoader actually uses it.

Why `dd` and `fio` Lie

Classic benchmarks drive sequential reads on a single file from a single process. That's a reasonable proxy for a backup job or a log tail. It's a terrible proxy for a DataLoader, which looks more like:

Many workers (typically 2-4× GPU count)
Small-to-medium reads (1KB – 1MB, depending on dataset format)
Random file access within shards
Epoch-level repetition over the same dataset
Bursty concurrency at batch boundaries

A storage layer that wins fio can lose badly on DataLoader workloads if it can't handle high concurrency with small reads.

What to Measure

1. Samples Per Second at Full GPU Concurrency

The only number that really matters. Spin up a training script with a no-op model that just consumes batches:

import time
from torch.utils.data import DataLoader
from webdataset import WebDataset

ds = WebDataset("/mnt/data/train-{000000..001023}.tar").decode().to_tuple("jpg", "cls")
loader = DataLoader(
    ds,
    batch_size=512,
    num_workers=16,
    prefetch_factor=4,
    pin_memory=True,
)

start = time.time()
count = 0
for batch in loader:
    count += batch[0].shape[0]
    if count >= 100_000:
        break
elapsed = time.time() - start
print(f"{count / elapsed:.0f} samples/sec")

Run this against each storage option you're considering. Match the num_workers, batch_size, and shard layout to your real training config.

2. Time-to-First-Batch

How long from process start to first batch ready? This matters for short jobs and for frequent restarts. Cold-cache storage layers are often 10-100× slower here than warm ones.

3. Cache Hit Ratio Over an Epoch

If your storage has a cache (local NVMe, Training Pipes gateway, etc.), measure the hit ratio after one full epoch. For a well-sized cache you want >95%. If it's lower, either the cache is too small or the eviction policy is wrong for your access pattern.

4. Tail Latency on Individual Reads

Average read latency is misleading. A storage layer that averages 2ms reads but occasionally blocks for 200ms will stall your DataLoader more than one that averages 5ms reads with a flat distribution. Measure P99 and P99.9 on individual file reads under concurrent load.

5. Cross-Epoch Cost

Most training runs sweep the dataset multiple times. Measure the total egress / request cost across an entire run, not a single epoch. Caching architectures look dramatically better here because later epochs are nearly free.

Typical Results

To give you a feel, here's what we see across common setups for a WebDataset-backed ImageNet-scale workload, 8×H100 node, same us-east-1 region:

Setup	Samples/sec	Time-to-first-batch	Effective cost/epoch
s3fs + S3 Standard	1,800	45s	High (per-request + egress)
Local NVMe copy	14,000	8 min (copy) + 3s	Low, but ephemeral
EFS Standard	9,500	6s	Very high (capacity-priced)
Training Pipes gateway (cold)	7,200	4s	Low
Training Pipes gateway (warm)	13,800	3s	Near-zero

A few things worth noting:

s3fs is the bottleneck, not S3 itself. The DataLoader can't saturate S3's real throughput through a FUSE daemon.
Local NVMe is fast once primed. It's the eight-minute cold start that kills it.
EFS is surprisingly competitive on throughput; it loses on cost, not speed.
A caching gateway matches local NVMe after warmup but without the copy step.

(Your numbers will vary with dataset size, shard format, region, and workload specifics. Don't trust these — measure your own.)

Mistakes to Avoid When Benchmarking

Benchmarking a Fake Workload

Don't benchmark with num_workers=1 and batch_size=1. That measures a pathological access pattern that no real training job uses.

Ignoring Warm State

Run each benchmark at least twice. The first run measures cold cache + whatever warm-up your storage does. The second run measures steady state. Publishing only the first is unfair; publishing only the second is misleading.

Measuring in the Wrong Region

If your benchmark host is in us-east-1 and your storage is in us-west-2, you're measuring cross-region latency, not the storage itself. Match the regions to what you'd deploy in production.

Forgetting the CPU

DataLoader throughput is sometimes bottlenecked on CPU-side decode (JPEG decode, tensor conversion), not storage. If the CPU is pegged before you max out the disk, you're benchmarking the CPU. Profile with py-spy or cProfile to be sure.

How Training Pipes Does on These Benchmarks

The short version: warm-cache Training Pipes matches local NVMe for DataLoader throughput, and the cold-cache case is ~50% of warm — which is fine because training jobs normally run multiple epochs, and only the first is cold.

The longer version: we publish our own numbers and update them with each release. We also let you measure your own workload against our platform on the free tier, which is the only benchmark we actually trust.

Run your own benchmark on us →

Wednesday, May 20, 2026·4 min read

PyTorch DataLoader Storage Benchmarks: Throughput That Actually Matters

Why `dd` and `fio` Lie

What to Measure

1. Samples Per Second at Full GPU Concurrency

2. Time-to-First-Batch

3. Cache Hit Ratio Over an Epoch

4. Tail Latency on Individual Reads

5. Cross-Epoch Cost

Typical Results

Mistakes to Avoid When Benchmarking

Benchmarking a Fake Workload

Ignoring Warm State

Measuring in the Wrong Region

Forgetting the CPU

How Training Pipes Does on These Benchmarks

Product

Company

Support

Legal

Wednesday, May 20, 2026·4 min read

PyTorch DataLoader Storage Benchmarks: Throughput That Actually Matters

Why dd and fio Lie

What to Measure

1. Samples Per Second at Full GPU Concurrency

2. Time-to-First-Batch

3. Cache Hit Ratio Over an Epoch

4. Tail Latency on Individual Reads

5. Cross-Epoch Cost

Typical Results

Mistakes to Avoid When Benchmarking

Benchmarking a Fake Workload

Ignoring Warm State

Measuring in the Wrong Region

Forgetting the CPU

How Training Pipes Does on These Benchmarks

Product

Company

Support

Legal

Why `dd` and `fio` Lie