Sunday, April 19, 2026·4 min read
Why Object Storage Isn't Enough for Modern ML Training
If you've spent any time training models on cloud infrastructure, you've had this conversation: "the data lives in S3, but training needs a filesystem." What follows is usually a week of yak-shaving that has nothing to do with modeling.
Object storage was designed for web assets, backups, and analytics — not for feeding thousands of parallel reads into a GPU cluster. It's time to talk about why, and what actually works instead.
What Object Storage Gives You
S3 and its siblings (GCS, Azure Blob, R2, B2, Wasabi) solve a narrow set of problems incredibly well:
- Durability: 11 nines, redundant across zones
- Scale: effectively infinite capacity
- Cost: cents per GB per month for cold storage
- Simple API: PUT, GET, LIST, DELETE over HTTP
For storing model weights you rarely touch, datasets you want to archive, or artifacts from a pipeline run, object storage is the right answer. It's the default for a reason.
Where It Breaks Down for Training
The trouble starts the moment your training code needs to actually read that data.
1. No Native Random Access
S3 lets you GET an object by key. Your PyTorch DataLoader expects to open() a file and seek within it. The gap between those two APIs is where people start duct-taping solutions. You end up with every worker fetching the entire shard before it can start reading, or issuing a range request per sample and paying HTTP overhead thousands of times per batch.
2. Latency That Kills GPU Utilization
An H100 costs roughly $2-4/hour. If your data pipeline can't keep up, the GPU idles. S3 request latency is fine for one request (~20-50ms first byte), but your DataLoader is firing hundreds of concurrent requests per second. Every tail-latency spike is a stalled batch. Every stalled batch is money on fire.
3. No Filesystem Semantics
Your training framework expects:
- Directories you can list
- Files you can
stat() mmapfor memory-mapped reads- POSIX permissions
- Atomic rename
Object storage has none of these. Bucket "folders" are just key prefixes. There's no mmap. There's no rename — it's copy-then-delete, which breaks checkpointing.
4. Egress Costs Multiply
Each epoch, every worker re-reads the dataset. Multiply by the number of runs in a week, the number of engineers, and (if your training cluster is in a different region from your bucket) cross-region egress fees. Suddenly "cheap" storage isn't cheap at all.
The Workarounds People Try
Every team goes through the same stages of grief:
Stage 1: Copy the data locally. Works for a 50GB dataset. Falls apart at a terabyte. Falls apart completely when you have a hundred datasets.
Stage 2: Mount S3 with FUSE (s3fs, goofys). The GitHub stars suggest it's a good idea. Running it in production teaches you otherwise. We wrote a whole post on this.
Stage 3: Adopt EFS or FSx. Now you have a POSIX filesystem, but also a bill that scales linearly with capacity whether you're reading the data or not. And the data still needs to get into EFS from wherever it came from.
Stage 4: Build a custom data loader. You now maintain a small distributed system. Congratulations.
What Actually Works: Caching Gateway + POSIX Protocol
The pattern that survives contact with real GPU clusters has three ingredients:
- Keep the durable copy in object storage. Cheap, safe, infinite.
- Put a caching gateway near your compute. Local NVMe absorbs the hot working set. Cache misses fall through to object storage transparently.
- Expose the data over a real filesystem protocol. NFS or SMB, so your training code doesn't care where the bytes live.
This is exactly what Training Pipes is. You create a bucket (managed on our infrastructure, or bring your own), mount it over NFSv4 near your GPUs, and your DataLoader sees a normal filesystem. The caching layer handles the access pattern; the object store handles the durability.
# Create a managed bucket
npx bucketfs buckets create --name datasets
# Mount it near your compute over NFS
npx bucketfs mount create --bucket datasets --region us-east-1 --protocol nfs
# Point your training script at it
export DATA_DIR=/mnt/training-data
No FUSE tricks. No EFS bills that scale with idle capacity. No custom data loader to maintain.
When Plain Object Storage Is Still the Right Call
Before we come off as religious about this: plain object storage is still the right default for a few workloads.
- Archival: datasets you might need someday but aren't actively training on
- Artifacts: model weights for deployed inference servers (pulled once)
- Small, infrequent access: sampling a few thousand rows for a dashboard
What doesn't belong in plain object storage is the hot path of training.
The Bottom Line
Object storage is a building block. Training needs a storage system — one that combines the economics of S3 with the semantics of a filesystem and the performance of local disk. That's the gap we built Training Pipes to close.