Thursday, April 23, 2026·3 min read

NFS vs S3 for AI Training: When to Use Each

Training Pipes Team
Network of servers lit up at night

"Should I use NFS or S3?" is one of those questions that looks like it has a right answer until you spend a few minutes with it. The honest truth: the two protocols are solving different problems, and the best ML infrastructure teams use both — usually at the same time.

Here's how to think about it.

What NFS and S3 Actually Are

NFS (Network File System) is a protocol for remote filesystem access. Your OS mounts it, your apps see regular files and directories, and every open() / read() / write() call talks to a server over the network. It's been around since 1984.

S3 (Simple Storage Service) is an API for object storage. You PUT and GET blobs by key. There is no file handle, no seek cursor, no directory tree (just key prefixes). AWS launched it in 2006 and every major cloud now has a compatible equivalent.

They aren't competitors. They're different abstractions.

When S3 Wins

S3 (and S3-compatible object storage) is the right choice when:

  • Durability and scale matter more than latency. Backups, archives, build artifacts, model registries.
  • Access is infrequent or streaming. One big read or write, not thousands of small ones.
  • Cost per GB dominates. Cold-tier object storage is dramatically cheaper than any filesystem.
  • You need global distribution. S3 replicates across zones and regions with a single API call.
  • Your consumers are other services, not training code. A Spark job, a data warehouse load, a static site.

In short: S3 is ideal for data at rest.

When NFS Wins

NFS is the right choice when:

  • Code expects a filesystem. Most ML frameworks, scientific computing tools, and legacy applications do.
  • You need POSIX semantics. mmap, stat, atomic renames, locks.
  • Multiple nodes need to read the same data concurrently. A distributed training job with 8 nodes sharing a dataset.
  • Random access patterns. DataLoaders sampling from random offsets in large files.
  • Low-latency small reads. Model configs, tokenizer files, small metadata files.

In short: NFS is ideal for data in use by compute.

The Real Pattern for ML Teams

If you look at how mature ML platforms are built, the answer isn't "pick one." It's:

  • Store the canonical, durable copy in S3 (or equivalent). Keep versioning, lifecycle policies, and archival there.
  • Serve training reads over NFS from a caching layer that sits between the GPUs and S3. The NFS layer handles the hot working set; S3 handles durability.

The problem is that building that caching middle layer yourself is miserable. You need a fleet of servers, cache eviction logic, consistency semantics, regional deployment, monitoring, failover. This is why Training Pipes exists.

Side-by-Side

Dimension S3 NFS
Protocol HTTP REST RPC (TCP)
Access unit Object (key + blob) File (handle + offset)
Semantics Eventual / strong depending on op POSIX
Concurrent readers Effectively unlimited Depends on server
Small file cost High (per-request overhead) Low (persistent connection)
Typical latency 20-100ms first byte Sub-ms (LAN)
Typical use Storage of record Live compute
Cost model Per GB + requests + egress Per GB + capacity provisioned

Common Mistakes

"We'll just use s3fs." FUSE-based S3 mounts give you a filesystem view of S3, but they don't give you NFS performance. See why FUSE isn't enough for production.

"We'll copy the data to local disk at job start." Works until your dataset exceeds local disk or your jobs take longer to start than to run.

"We'll use EFS for everything." EFS is a real NFS filesystem, but it's priced for persistent storage, not a cache. You'll pay for the whole dataset capacity even if only 5% is hot.

"We'll pick one and force it everywhere." Then every team works around you.

The Training Pipes Approach

We give you both without forcing you to build the middle layer.

  1. Your canonical data lives in object storage — either managed buckets we provide, or your existing S3 / GCS / R2.
  2. Our regional gateways cache the hot working set on local NVMe near your compute.
  3. You mount that gateway over NFSv4 (or SMB). Your training code sees a filesystem.
  4. The same bucket is also available via S3-compatible API for jobs that want it.

You don't choose NFS or S3. You get both, pointing at the same data.

Mount a bucket in five minutes →