Sunday, April 26, 2026·4 min read
A Practical Guide to Mounting Cloud Storage for GPU Training
You have GPUs. You have data in the cloud. You need the first to read the second without falling over. This is a practical walkthrough of how to make that happen in 2026, without the usual duct tape.
The Goal
At the end of this guide, your training script will do this:
dataset = MyDataset("/mnt/training-data/imagenet-shards")
loader = DataLoader(dataset, num_workers=16, batch_size=512)
And the bytes will arrive fast enough to saturate your GPU, even if the "real" copy of the data lives in object storage hundreds of miles away.
The Three Bad Options
Before the good path, here's what to avoid.
Option A: Copy Everything to Local Disk
aws s3 cp --recursive s3://my-bucket/imagenet /mnt/local/imagenet
Works until: your dataset exceeds local disk, your job startup time exceeds your run time, or you have dozens of datasets you might train on.
Option B: s3fs / goofys
s3fs my-bucket /mnt/s3 -o iam_role=auto
Works until: your DataLoader hammers it with concurrent small reads. POSIX semantics are a lie. Tail latencies crater GPU utilization. See our deep dive on FUSE alternatives.
Option C: Managed NFS (EFS, FSx)
Real POSIX. Real performance. Real bill. You pay for provisioned capacity regardless of hot-working-set size, and the data still has to get there from wherever it originated.
The Good Path: Caching Gateway + NFS Mount
This is the pattern that works in production. We'll do it with Training Pipes, but the concepts apply to any equivalent caching layer.
Step 1: Create a Bucket
If you already have data in S3, GCS, or an S3-compatible store, you'll use a BYO connection (see step 4). Otherwise, start with a managed bucket:
npm install -g bucketfs
bucketfs auth login
bucketfs buckets create --name training-datasets
You now have a logical bucket. Under the hood, it's backed by durable object storage.
Step 2: Upload Your Data
Any S3-compatible client works, because managed buckets expose an S3 API:
bucketfs buckets credentials training-datasets
# -> access_key, secret_key, endpoint
aws s3 --endpoint-url $ENDPOINT cp --recursive ./imagenet s3://training-datasets/imagenet
Step 3: Create a Regional Mount
This is the important step. You're asking the platform to spin up a gateway in a specific region — ideally the same region as your GPU cluster.
bucketfs mount create \
--bucket training-datasets \
--region us-east-1 \
--protocol nfs \
--cache-size 500GB \
--preload "imagenet/**"
The --cache-size sets how much local NVMe the gateway dedicates to caching. The --preload flag warms the cache before your job starts so the first epoch isn't a cache-miss storm.
Step 4: (Optional) BYO Bucket
Already have data in S3? Connect it instead of re-uploading:
bucketfs connections create \
--name my-existing-data \
--provider aws-s3 \
--bucket my-company-datasets \
--region us-east-1
bucketfs mount create \
--connection my-existing-data \
--region us-east-1 \
--protocol nfs
Same mount experience, your existing bucket as the backing store.
Step 5: Mount on Your GPU Nodes
The mount create command gives you connection details. On each GPU node:
sudo mkdir -p /mnt/training-data
sudo mount -t nfs4 $NFS_SERVER:/training-datasets /mnt/training-data
(Training Pipes wraps the connection in WireGuard so you're not exposing NFS to the public internet.)
Step 6: Train
from torch.utils.data import DataLoader
dataset = WebDataset("/mnt/training-data/imagenet-shards/{000000..001023}.tar")
loader = DataLoader(
dataset,
num_workers=16,
batch_size=512,
pin_memory=True,
prefetch_factor=4,
)
for batch in loader:
...
That's it. From your script's perspective it's just a filesystem. From the network's perspective, hot reads hit local NVMe cache and cold reads fall through to the object store.
Tuning Tips
Match Gateway Region to GPU Region
This is the single biggest lever. A gateway in the same region as your GPUs is the difference between sub-millisecond reads and 40ms S3 round-trips.
Use Preload for Deterministic Datasets
If you know the files your job will touch (most supervised training does), preload them. This eliminates cold-cache thrashing on the first epoch.
Size the Cache to the Working Set, Not the Dataset
If your dataset is 10TB but any given epoch only touches 2TB of shards, a 2.5TB cache is enough. The cache is a hot-set accelerator, not a full copy.
Use WebDataset or TFRecord Sharding
Many small files still work, but large sequentially-read shards (WebDataset .tar or TFRecords) play much nicer with any caching layer.
Don't Fight the DataLoader
num_workers should be 2-4× your GPU count. prefetch_factor=4 gives the pipeline room to absorb latency spikes. pin_memory=True helps the PCIe transfer.
What You Just Avoided
- Building a caching layer yourself
- Paying for 10TB of EFS when you only touch 2TB at a time
- Debugging FUSE kernel panics at 3am
- Copying data into ephemeral disk on every job start
- Cross-region egress on every epoch
Ready to Try It?
Create an account, install the CLI, and you'll have a mounted bucket in under five minutes.