Friday, June 5, 2026·4 min read
Kubernetes Persistent Volumes for ML: A Storage Pattern Guide
Kubernetes has a lot of good ideas about storage — PersistentVolumes, PersistentVolumeClaims, StorageClasses, CSI drivers — and they were mostly designed for stateful applications like databases, not for ML training. Applying them naively to a training workload gives you a mess.
This post walks through the main PV patterns for ML workloads and shows which ones hold up.
The Three Storage Roles in an ML Pod
Before picking a PV type, identify what role each volume plays. Training pods typically need:
- Read-only dataset volume. Large, shared across pods, read from repeatedly.
- Shared working volume. Checkpoints, logs, intermediate files that outlive the pod but not the cluster.
- Ephemeral scratch. Activations, tmp files, anything you don't care about after the pod exits.
Different roles want different storage types. Using one PV type for everything is the first mistake.
Pattern 1: EBS / Local NVMe for Scratch
Role: ephemeral scratch.
volumes:
- name: scratch
emptyDir:
medium: Memory # or local SSD
For scratch, don't overthink it. emptyDir on local NVMe is fast and free. EBS-backed PVCs give you durability you don't need for scratch data.
Pattern 2: EBS for Single-Pod Stateful Storage
Role: per-pod state (rare in training, common in inference).
volumeClaimTemplates:
- metadata:
name: state
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources:
requests:
storage: 500Gi
Good for databases, model servers, things with one writer. Not good for datasets because RWO means only one pod can mount it.
Pattern 3: EFS for Shared Datasets (The Usual Choice, Usually Wrong)
Role: shared dataset, ML.
volumes:
- name: dataset
persistentVolumeClaim:
claimName: efs-dataset-pvc
This is what most teams reach for first. It's a real NFS filesystem. It supports ReadWriteMany. It "just works" with any pod.
What you don't see until the bill comes:
- EFS is priced by provisioned capacity, not hot working set
- No local caching layer — every read crosses the AZ boundary to EFS endpoints
- Throughput on Elastic Throughput mode is priced per GB read — training that reads the same data 100x costs 100x
We ran the numbers in EFS vs Training Pipes.
Pattern 4: FUSE-Based S3 CSI Drivers
Role: shared dataset, cheap.
volumes:
- name: dataset
persistentVolumeClaim:
claimName: s3-dataset-pvc # backed by Mountpoint-for-S3 CSI
AWS's CSI driver for Mountpoint-for-S3 (and similar for GCS, Azure) gives you an S3 bucket as a PV.
Pros: cheap storage. Works across clusters.
Cons: it's still FUSE. Limited POSIX semantics. Per-pod cache (so N pods refetch the same data N times). Deep dive in why FUSE isn't enough.
Pattern 5: NFS CSI Driver Pointing at a Caching Gateway
Role: shared dataset, efficient.
apiVersion: v1
kind: PersistentVolume
metadata:
name: training-data-pv
spec:
capacity:
storage: 10Ti
accessModes:
- ReadWriteMany
nfs:
server: gateway-us-east-1.trainingpipes.com
path: /my-bucket
mountOptions:
- vers=4.1
- hard
This is the pattern that holds up. A single NFS server (the gateway) provides a shared, cached view of object storage. Every pod in every training job mounts the same PVC and gets consistent, fast reads.
Compared to EFS:
- Cheaper (object-storage pricing + cache)
- Faster (local NVMe cache near the cluster)
- Works across clouds (canonical data can live anywhere S3-compatible)
Compared to FUSE:
- Real NFS (no FUSE privileged pods)
- Shared cache across all pods
- Standard Kubernetes NFS plumbing
Putting It Together: a Full Pod Spec
apiVersion: apps/v1
kind: Job
metadata:
name: train-resnet
spec:
template:
spec:
containers:
- name: train
image: my-training-image:latest
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: dataset
mountPath: /mnt/datasets
readOnly: true
- name: checkpoints
mountPath: /mnt/checkpoints
- name: scratch
mountPath: /scratch
volumes:
- name: dataset
persistentVolumeClaim:
claimName: tp-datasets-ro
- name: checkpoints
persistentVolumeClaim:
claimName: tp-checkpoints-rw
- name: scratch
emptyDir:
medium: Memory
Two mounts via the same Training Pipes gateway (one RO for datasets, one RW for checkpoints) plus an in-memory scratch. No per-pod FUSE daemons, no EFS capacity overprovisioning, no privileged containers.
Network Plumbing
One subtlety: the NFS server needs to be reachable from pods. Options:
- Public NFS over WireGuard — Training Pipes default. Mount target includes a WireGuard tunnel, so NFS never traverses the open internet in the clear.
- PrivateLink / VPC peering — for enterprise deployments where you want NFS over a private network path. Available on higher tiers.
- Sidecar mount helpers — for clusters that don't allow direct mounts, a sidecar handles the mount and exposes it to the main container.
Operational Gotchas
StorageClass Binding Mode
Use WaitForFirstConsumer for any PV that's zonal (EBS, local disk). NFS-backed PVs (like a Training Pipes mount) are region-wide and don't need it.
Access Modes
ReadWriteMany(RWX): dataset mounts, checkpoint mountsReadWriteOnce(RWO): single-pod scratch that needs to survive restartsReadOnlyMany(ROX): enforced read-only shared mounts
CSI Volume Expansion
Gateway-backed NFS PVs don't need expansion — the capacity is whatever the backing bucket has. EBS PVs need explicit expansion, usually requiring a pod restart.
Stuck Mounts on Pod Eviction
NFS mounts with hard can hang on network blips. Use soft,timeo=600,retrans=2 for aggressive workloads, but be aware it can cause I/O errors instead of blocking. Training Pipes mounts default to hard with reasonable timeouts.
The Short Version
For ML in Kubernetes:
emptyDiron local NVMe for scratch- NFS CSI pointing at a caching gateway (Training Pipes) for datasets and checkpoints
- EBS for single-pod state that isn't training-related
- Avoid EFS for high-read-multiplier workloads
- Avoid FUSE CSI drivers for anything bigger than a hobby project