Friday, June 5, 2026·4 min read

Kubernetes Persistent Volumes for ML: A Storage Pattern Guide

Training Pipes Team
Kubernetes cluster architecture diagram

Kubernetes has a lot of good ideas about storage — PersistentVolumes, PersistentVolumeClaims, StorageClasses, CSI drivers — and they were mostly designed for stateful applications like databases, not for ML training. Applying them naively to a training workload gives you a mess.

This post walks through the main PV patterns for ML workloads and shows which ones hold up.

The Three Storage Roles in an ML Pod

Before picking a PV type, identify what role each volume plays. Training pods typically need:

  1. Read-only dataset volume. Large, shared across pods, read from repeatedly.
  2. Shared working volume. Checkpoints, logs, intermediate files that outlive the pod but not the cluster.
  3. Ephemeral scratch. Activations, tmp files, anything you don't care about after the pod exits.

Different roles want different storage types. Using one PV type for everything is the first mistake.

Pattern 1: EBS / Local NVMe for Scratch

Role: ephemeral scratch.

volumes:
  - name: scratch
    emptyDir:
      medium: Memory  # or local SSD

For scratch, don't overthink it. emptyDir on local NVMe is fast and free. EBS-backed PVCs give you durability you don't need for scratch data.

Pattern 2: EBS for Single-Pod Stateful Storage

Role: per-pod state (rare in training, common in inference).

volumeClaimTemplates:
  - metadata:
      name: state
    spec:
      accessModes: [ReadWriteOnce]
      storageClassName: gp3
      resources:
        requests:
          storage: 500Gi

Good for databases, model servers, things with one writer. Not good for datasets because RWO means only one pod can mount it.

Pattern 3: EFS for Shared Datasets (The Usual Choice, Usually Wrong)

Role: shared dataset, ML.

volumes:
  - name: dataset
    persistentVolumeClaim:
      claimName: efs-dataset-pvc

This is what most teams reach for first. It's a real NFS filesystem. It supports ReadWriteMany. It "just works" with any pod.

What you don't see until the bill comes:

  • EFS is priced by provisioned capacity, not hot working set
  • No local caching layer — every read crosses the AZ boundary to EFS endpoints
  • Throughput on Elastic Throughput mode is priced per GB read — training that reads the same data 100x costs 100x

We ran the numbers in EFS vs Training Pipes.

Pattern 4: FUSE-Based S3 CSI Drivers

Role: shared dataset, cheap.

volumes:
  - name: dataset
    persistentVolumeClaim:
      claimName: s3-dataset-pvc   # backed by Mountpoint-for-S3 CSI

AWS's CSI driver for Mountpoint-for-S3 (and similar for GCS, Azure) gives you an S3 bucket as a PV.

Pros: cheap storage. Works across clusters.

Cons: it's still FUSE. Limited POSIX semantics. Per-pod cache (so N pods refetch the same data N times). Deep dive in why FUSE isn't enough.

Pattern 5: NFS CSI Driver Pointing at a Caching Gateway

Role: shared dataset, efficient.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: training-data-pv
spec:
  capacity:
    storage: 10Ti
  accessModes:
    - ReadWriteMany
  nfs:
    server: gateway-us-east-1.trainingpipes.com
    path: /my-bucket
  mountOptions:
    - vers=4.1
    - hard

This is the pattern that holds up. A single NFS server (the gateway) provides a shared, cached view of object storage. Every pod in every training job mounts the same PVC and gets consistent, fast reads.

Compared to EFS:

  • Cheaper (object-storage pricing + cache)
  • Faster (local NVMe cache near the cluster)
  • Works across clouds (canonical data can live anywhere S3-compatible)

Compared to FUSE:

  • Real NFS (no FUSE privileged pods)
  • Shared cache across all pods
  • Standard Kubernetes NFS plumbing

Putting It Together: a Full Pod Spec

apiVersion: apps/v1
kind: Job
metadata:
  name: train-resnet
spec:
  template:
    spec:
      containers:
        - name: train
          image: my-training-image:latest
          resources:
            limits:
              nvidia.com/gpu: 8
          volumeMounts:
            - name: dataset
              mountPath: /mnt/datasets
              readOnly: true
            - name: checkpoints
              mountPath: /mnt/checkpoints
            - name: scratch
              mountPath: /scratch
      volumes:
        - name: dataset
          persistentVolumeClaim:
            claimName: tp-datasets-ro
        - name: checkpoints
          persistentVolumeClaim:
            claimName: tp-checkpoints-rw
        - name: scratch
          emptyDir:
            medium: Memory

Two mounts via the same Training Pipes gateway (one RO for datasets, one RW for checkpoints) plus an in-memory scratch. No per-pod FUSE daemons, no EFS capacity overprovisioning, no privileged containers.

Network Plumbing

One subtlety: the NFS server needs to be reachable from pods. Options:

  • Public NFS over WireGuard — Training Pipes default. Mount target includes a WireGuard tunnel, so NFS never traverses the open internet in the clear.
  • PrivateLink / VPC peering — for enterprise deployments where you want NFS over a private network path. Available on higher tiers.
  • Sidecar mount helpers — for clusters that don't allow direct mounts, a sidecar handles the mount and exposes it to the main container.

Operational Gotchas

StorageClass Binding Mode

Use WaitForFirstConsumer for any PV that's zonal (EBS, local disk). NFS-backed PVs (like a Training Pipes mount) are region-wide and don't need it.

Access Modes

  • ReadWriteMany (RWX): dataset mounts, checkpoint mounts
  • ReadWriteOnce (RWO): single-pod scratch that needs to survive restarts
  • ReadOnlyMany (ROX): enforced read-only shared mounts

CSI Volume Expansion

Gateway-backed NFS PVs don't need expansion — the capacity is whatever the backing bucket has. EBS PVs need explicit expansion, usually requiring a pod restart.

Stuck Mounts on Pod Eviction

NFS mounts with hard can hang on network blips. Use soft,timeo=600,retrans=2 for aggressive workloads, but be aware it can cause I/O errors instead of blocking. Training Pipes mounts default to hard with reasonable timeouts.

The Short Version

For ML in Kubernetes:

  • emptyDir on local NVMe for scratch
  • NFS CSI pointing at a caching gateway (Training Pipes) for datasets and checkpoints
  • EBS for single-pod state that isn't training-related
  • Avoid EFS for high-read-multiplier workloads
  • Avoid FUSE CSI drivers for anything bigger than a hobby project

Try an NFS-backed PVC for your cluster →