"Our GPUs are only 40% utilized" is the single most common complaint we hear from ML platform teams. The GPUs aren't broken. The compute isn't the bottleneck. The data pipeline can't keep up.

A regional caching gateway — a shared cache colocated with your training cluster that sits between the compute and your object storage — is the highest-leverage fix for this problem. This post is the architectural walkthrough of why it works.

The Latency Budget

An H100 doing a forward+backward pass on a reasonable batch is ready for the next batch in ~100-300ms. If the DataLoader can't produce a batch in that window, the GPU waits.

Now look at where the bytes come from:

Source	First-byte latency
Local DRAM	~100 ns
Local NVMe	~50-100 µs
Same-AZ NFS server	~0.5-2 ms
Same-region S3	~20-60 ms
Cross-region S3	~60-200 ms

Every hop up that table is a 10-1000× latency jump. A DataLoader hitting S3 directly spends most of its time waiting for first-byte latency, not transferring bytes.

What a Regional Caching Gateway Does

Picture a server sitting inside your compute region, with a pile of NVMe. It has two jobs:

Speak a filesystem protocol to clients. NFSv4 or SMB. Your training nodes mount it like any other NFS server.
Translate to object storage on the backend. Fetch from S3 on a cache miss; write through on uploads.

Training nodes see a filesystem. The gateway sees NFS requests and figures out which ones can be served from its NVMe cache and which ones need to fall through to object storage.

[GPU node] --NFSv4--> [regional gateway w/ NVMe cache] --S3 API--> [object storage]

The regional part matters. The gateway lives in the same cloud region (often the same availability zone) as your GPUs. That means the GPU-to-gateway hop is sub-millisecond over the datacenter network, even though the underlying object storage might be in a different region or a different provider entirely.

Where the 10x Comes From

Three compounding effects.

1. Cache Hits Skip S3 Entirely

Once a shard is in the gateway's NVMe cache, every subsequent read from any client is a local NVMe read. 50µs instead of 50ms. For a training job that sweeps the dataset multiple times per epoch, this is the biggest win.

2. Shared Cache Across the Cluster

In a FUSE-based setup, each GPU node has its own cache (or no cache). Eight nodes = eight independent fetches of the same shard.

With a gateway, the first node to request a shard pays the S3 fetch. The other seven read from the gateway's cache at LAN speed. That's 8× fewer S3 requests and 8× less cross-boundary bandwidth.

3. Prefetch and Preload

A smart gateway prefetches ahead of the cursor on sequential reads. Training workloads read shards in a predictable order (often via WebDataset or TFRecord). If the gateway notices the pattern, it can fetch shard N+1 from S3 while you're still reading N, so N+1 is a cache hit when you get to it.

Training Pipes lets you go further and explicitly preload a path prefix before a job starts:

bucketfs mount create --bucket datasets \
  --region us-east-1 \
  --preload "imagenet/train-*"

By the time training starts, the hot set is already warm.

Why Not Just Use a Bigger Local Disk?

You could imagine copying the entire dataset to each GPU node's local disk at job start. People do it. It fails for three reasons:

Scale: large datasets don't fit on local disk. Or they do, but eat into the space you need for activations and checkpoints.

Cold starts: spinning up a new job means waiting for the copy to finish. On a 10TB dataset this can exceed the job runtime.

Write-back: checkpoints, logs, and outputs need to go back to durable storage. Local disk is ephemeral. The gateway handles write-through automatically.

Why Not Just Put Everything in EFS?

Managed NFS services like EFS and FSx give you POSIX but not caching semantics. They price by provisioned capacity, so you pay for the entire dataset — even the cold parts you rarely read. A caching gateway pays cold-storage rates for the full dataset and NVMe-cache rates only for the hot working set. Usually an order of magnitude cheaper at scale. See our EFS cost breakdown.

Anatomy of a Well-Tuned Gateway

If you were building this from scratch, you'd need to get all of these right:

NVMe-backed cache with an eviction policy that favors recently-accessed large sequential files.
Parallel prefetching so sequential reads don't serialize on single-object latency.
Write-through or write-back semantics for uploads, configurable per mount.
Consistency model that matches what training code expects (typically read-after-write for new keys, eventual for overwrites).
Observability — cache hit ratio, tail latency, backend fetch rate, per-mount throughput.
Secure transport — NFS over the open internet is a bad idea. WireGuard tunnels solve this.
Graceful degradation — if the cache is full or the backend is slow, keep serving what you can.

Running one of these in-house is a full-time job for an infra team.

The Product Pitch, Briefly

This is what Training Pipes operates. You create a bucket (managed by us, or connect your own), choose a region, and we run a gateway there with all the above handled. Your training code sees an NFS mount. We handle the rest.

You can choose the cache size, preload specific paths, and see hit-ratio metrics in the dashboard. When a run finishes, the cache stays warm for the next run in the same region — so repeated training cycles get progressively cheaper and faster.

Deploy a regional gateway →

Tuesday, May 12, 2026·4 min read

How Regional Caching Gateways Cut ML Data Loading Time by 10x