Sunday, April 26, 2026·4 min read

A Practical Guide to Mounting Cloud Storage for GPU Training

Training Pipes Team
Close-up of GPU server rack

You have GPUs. You have data in the cloud. You need the first to read the second without falling over. This is a practical walkthrough of how to make that happen in 2026, without the usual duct tape.

The Goal

At the end of this guide, your training script will do this:

dataset = MyDataset("/mnt/training-data/imagenet-shards")
loader = DataLoader(dataset, num_workers=16, batch_size=512)

And the bytes will arrive fast enough to saturate your GPU, even if the "real" copy of the data lives in object storage hundreds of miles away.

The Three Bad Options

Before the good path, here's what to avoid.

Option A: Copy Everything to Local Disk

aws s3 cp --recursive s3://my-bucket/imagenet /mnt/local/imagenet

Works until: your dataset exceeds local disk, your job startup time exceeds your run time, or you have dozens of datasets you might train on.

Option B: s3fs / goofys

s3fs my-bucket /mnt/s3 -o iam_role=auto

Works until: your DataLoader hammers it with concurrent small reads. POSIX semantics are a lie. Tail latencies crater GPU utilization. See our deep dive on FUSE alternatives.

Option C: Managed NFS (EFS, FSx)

Real POSIX. Real performance. Real bill. You pay for provisioned capacity regardless of hot-working-set size, and the data still has to get there from wherever it originated.

The Good Path: Caching Gateway + NFS Mount

This is the pattern that works in production. We'll do it with Training Pipes, but the concepts apply to any equivalent caching layer.

Step 1: Create a Bucket

If you already have data in S3, GCS, or an S3-compatible store, you'll use a BYO connection (see step 4). Otherwise, start with a managed bucket:

npm install -g bucketfs
bucketfs auth login

bucketfs buckets create --name training-datasets

You now have a logical bucket. Under the hood, it's backed by durable object storage.

Step 2: Upload Your Data

Any S3-compatible client works, because managed buckets expose an S3 API:

bucketfs buckets credentials training-datasets
# -> access_key, secret_key, endpoint

aws s3 --endpoint-url $ENDPOINT cp --recursive ./imagenet s3://training-datasets/imagenet

Step 3: Create a Regional Mount

This is the important step. You're asking the platform to spin up a gateway in a specific region — ideally the same region as your GPU cluster.

bucketfs mount create \
  --bucket training-datasets \
  --region us-east-1 \
  --protocol nfs \
  --cache-size 500GB \
  --preload "imagenet/**"

The --cache-size sets how much local NVMe the gateway dedicates to caching. The --preload flag warms the cache before your job starts so the first epoch isn't a cache-miss storm.

Step 4: (Optional) BYO Bucket

Already have data in S3? Connect it instead of re-uploading:

bucketfs connections create \
  --name my-existing-data \
  --provider aws-s3 \
  --bucket my-company-datasets \
  --region us-east-1

bucketfs mount create \
  --connection my-existing-data \
  --region us-east-1 \
  --protocol nfs

Same mount experience, your existing bucket as the backing store.

Step 5: Mount on Your GPU Nodes

The mount create command gives you connection details. On each GPU node:

sudo mkdir -p /mnt/training-data
sudo mount -t nfs4 $NFS_SERVER:/training-datasets /mnt/training-data

(Training Pipes wraps the connection in WireGuard so you're not exposing NFS to the public internet.)

Step 6: Train

from torch.utils.data import DataLoader

dataset = WebDataset("/mnt/training-data/imagenet-shards/{000000..001023}.tar")
loader = DataLoader(
    dataset,
    num_workers=16,
    batch_size=512,
    pin_memory=True,
    prefetch_factor=4,
)

for batch in loader:
    ...

That's it. From your script's perspective it's just a filesystem. From the network's perspective, hot reads hit local NVMe cache and cold reads fall through to the object store.

Tuning Tips

Match Gateway Region to GPU Region

This is the single biggest lever. A gateway in the same region as your GPUs is the difference between sub-millisecond reads and 40ms S3 round-trips.

Use Preload for Deterministic Datasets

If you know the files your job will touch (most supervised training does), preload them. This eliminates cold-cache thrashing on the first epoch.

Size the Cache to the Working Set, Not the Dataset

If your dataset is 10TB but any given epoch only touches 2TB of shards, a 2.5TB cache is enough. The cache is a hot-set accelerator, not a full copy.

Use WebDataset or TFRecord Sharding

Many small files still work, but large sequentially-read shards (WebDataset .tar or TFRecords) play much nicer with any caching layer.

Don't Fight the DataLoader

num_workers should be 2-4× your GPU count. prefetch_factor=4 gives the pipeline room to absorb latency spikes. pin_memory=True helps the PCIe transfer.

What You Just Avoided

  • Building a caching layer yourself
  • Paying for 10TB of EFS when you only touch 2TB at a time
  • Debugging FUSE kernel panics at 3am
  • Copying data into ephemeral disk on every job start
  • Cross-region egress on every epoch

Ready to Try It?

Create an account, install the CLI, and you'll have a mounted bucket in under five minutes.