Sunday, April 19, 2026·4 min read

Why Object Storage Isn't Enough for Modern ML Training

Training Pipes Team
Rack of servers with blue status lights

If you've spent any time training models on cloud infrastructure, you've had this conversation: "the data lives in S3, but training needs a filesystem." What follows is usually a week of yak-shaving that has nothing to do with modeling.

Object storage was designed for web assets, backups, and analytics — not for feeding thousands of parallel reads into a GPU cluster. It's time to talk about why, and what actually works instead.

What Object Storage Gives You

S3 and its siblings (GCS, Azure Blob, R2, B2, Wasabi) solve a narrow set of problems incredibly well:

  • Durability: 11 nines, redundant across zones
  • Scale: effectively infinite capacity
  • Cost: cents per GB per month for cold storage
  • Simple API: PUT, GET, LIST, DELETE over HTTP

For storing model weights you rarely touch, datasets you want to archive, or artifacts from a pipeline run, object storage is the right answer. It's the default for a reason.

Where It Breaks Down for Training

The trouble starts the moment your training code needs to actually read that data.

1. No Native Random Access

S3 lets you GET an object by key. Your PyTorch DataLoader expects to open() a file and seek within it. The gap between those two APIs is where people start duct-taping solutions. You end up with every worker fetching the entire shard before it can start reading, or issuing a range request per sample and paying HTTP overhead thousands of times per batch.

2. Latency That Kills GPU Utilization

An H100 costs roughly $2-4/hour. If your data pipeline can't keep up, the GPU idles. S3 request latency is fine for one request (~20-50ms first byte), but your DataLoader is firing hundreds of concurrent requests per second. Every tail-latency spike is a stalled batch. Every stalled batch is money on fire.

3. No Filesystem Semantics

Your training framework expects:

  • Directories you can list
  • Files you can stat()
  • mmap for memory-mapped reads
  • POSIX permissions
  • Atomic rename

Object storage has none of these. Bucket "folders" are just key prefixes. There's no mmap. There's no rename — it's copy-then-delete, which breaks checkpointing.

4. Egress Costs Multiply

Each epoch, every worker re-reads the dataset. Multiply by the number of runs in a week, the number of engineers, and (if your training cluster is in a different region from your bucket) cross-region egress fees. Suddenly "cheap" storage isn't cheap at all.

The Workarounds People Try

Every team goes through the same stages of grief:

Stage 1: Copy the data locally. Works for a 50GB dataset. Falls apart at a terabyte. Falls apart completely when you have a hundred datasets.

Stage 2: Mount S3 with FUSE (s3fs, goofys). The GitHub stars suggest it's a good idea. Running it in production teaches you otherwise. We wrote a whole post on this.

Stage 3: Adopt EFS or FSx. Now you have a POSIX filesystem, but also a bill that scales linearly with capacity whether you're reading the data or not. And the data still needs to get into EFS from wherever it came from.

Stage 4: Build a custom data loader. You now maintain a small distributed system. Congratulations.

What Actually Works: Caching Gateway + POSIX Protocol

The pattern that survives contact with real GPU clusters has three ingredients:

  1. Keep the durable copy in object storage. Cheap, safe, infinite.
  2. Put a caching gateway near your compute. Local NVMe absorbs the hot working set. Cache misses fall through to object storage transparently.
  3. Expose the data over a real filesystem protocol. NFS or SMB, so your training code doesn't care where the bytes live.

This is exactly what Training Pipes is. You create a bucket (managed on our infrastructure, or bring your own), mount it over NFSv4 near your GPUs, and your DataLoader sees a normal filesystem. The caching layer handles the access pattern; the object store handles the durability.

# Create a managed bucket
npx bucketfs buckets create --name datasets

# Mount it near your compute over NFS
npx bucketfs mount create --bucket datasets --region us-east-1 --protocol nfs

# Point your training script at it
export DATA_DIR=/mnt/training-data

No FUSE tricks. No EFS bills that scale with idle capacity. No custom data loader to maintain.

When Plain Object Storage Is Still the Right Call

Before we come off as religious about this: plain object storage is still the right default for a few workloads.

  • Archival: datasets you might need someday but aren't actively training on
  • Artifacts: model weights for deployed inference servers (pulled once)
  • Small, infrequent access: sampling a few thousand rows for a dashboard

What doesn't belong in plain object storage is the hot path of training.

The Bottom Line

Object storage is a building block. Training needs a storage system — one that combines the economics of S3 with the semantics of a filesystem and the performance of local disk. That's the gap we built Training Pipes to close.

Try it free →