Sunday, May 24, 2026·5 min read
Checkpointing Large Models: A Storage Guide for ML Engineers
Checkpointing a large model is where storage systems get embarrassed. Training a 70B-parameter model produces a 500GB+ checkpoint every hour. If that write blocks the training loop, your GPUs sit idle. If it fails, you lose a run. If you save a copy every step to be safe, your storage bill explodes.
This post is a practical guide to designing a checkpointing pipeline that's fast, durable, and affordable.
The Checkpoint Budget
Start with the constraints:
Write size: 50GB – 1TB per checkpoint, depending on model size and optimizer state.
Cadence: typically every 30 minutes to every few hours. Shorter cadences mean less lost work on failure, but more storage pressure.
Durability: must survive node failure. A checkpoint only on the node that wrote it is not a checkpoint.
Read cadence: rare. You resume from a checkpoint if something breaks. Most checkpoints are never read.
Tail: you keep many checkpoints. Sometimes dozens. Sometimes you promote one to "this is the model we ship."
Why Checkpoint Writes Are Hard
1. They're Bursty
A training loop produces a flat throughput of data reads. A checkpoint writes hundreds of gigabytes in a few seconds, then nothing for an hour. Your storage needs to handle the burst, not the average.
2. They Need to Be Atomic
A half-written checkpoint is worse than no checkpoint. If a node dies mid-write, the partial file shouldn't poison the restart. POSIX atomic rename is the usual answer: write to a temp name, fsync, rename to the final name. This requires real filesystem semantics — not all object-backed filesystems give you that.
3. They Block Training (If You're Not Careful)
Naively, torch.save(state, path) pauses the training step until the bytes are flushed. A 500GB write at 1GB/s is eight minutes of stalled GPUs. That's expensive.
4. They Multiply
Saving every hour for a week = 168 checkpoints. At 500GB each = 84TB. Do this for several concurrent runs and your checkpoint storage becomes the dominant storage cost.
Architecture Patterns
Pattern 1: Local Disk + Background Upload
Write to local NVMe (fast), then upload to durable storage in the background.
Pros: training resumes immediately. Local NVMe is very fast.
Cons: window of vulnerability between local write and durable copy. Node death during that window = lost checkpoint.
Pattern 2: Direct to Object Storage (S3 API)
Write the checkpoint via multipart upload directly to S3.
Pros: durable immediately once upload completes. Cheap.
Cons: upload bandwidth becomes the bottleneck. Writing 500GB to S3 over a single TCP connection at 100 MB/s is 80 minutes. Parallel multipart helps but requires careful tuning.
Pattern 3: NFS to a Caching Gateway
Write to a mounted NFS filesystem. The gateway handles:
- Write-back caching: writes land on local NVMe at full speed, training resumes immediately.
- Async durability: bytes drain to object storage in the background.
- Atomic rename: POSIX rename semantics work correctly.
- Asynchronous multipart upload: you don't manage the S3 API directly.
Pros: fastest write experience. Training barely notices the checkpoint. Durable within seconds.
Cons: briefly vulnerable during the drain window (measured in seconds, not minutes).
Pattern 4: Sharded Checkpoints (Distributed Writes)
With large distributed training (FSDP, DeepSpeed, Megatron), the model state is already sharded across nodes. Each node writes its own shard in parallel. This turns one 500GB write into eight 62GB concurrent writes, and total wall time drops accordingly.
Most teams combine Pattern 4 with one of Patterns 1-3 for the actual write target.
A Specific Recipe That Works
Here's a checkpoint pipeline we've seen work at scale:
1. Shard the Write Across Nodes
Use FSDP or DeepSpeed's checkpointing to give each rank its own piece of the state.
2. Mount a Training Pipes NFS Gateway on Every Node
sudo mount -t nfs4 $gateway:/checkpoints /mnt/ckpt
All nodes share the same namespace. Rank 0 writes its shard to /mnt/ckpt/step_1000/rank_0, rank 1 to rank_1, etc.
3. Write with Async Drain
def save_shard(state, path):
tmp = path + ".tmp"
torch.save(state, tmp)
os.fsync(open(tmp, "rb").fileno())
os.rename(tmp, path)
The gateway buffers the write to NVMe and acknowledges quickly. Your training loop moves on. Bytes drain to object storage in the background.
4. Barrier Before Continuing
torch.distributed.barrier()
Once all ranks confirm their write, the checkpoint is logically complete. The gateway continues draining to durable storage asynchronously.
5. Lifecycle Policies on the Object Store
Don't keep every checkpoint forever. Configure your bucket to move older checkpoints to a cheaper storage class after N days, and to delete after M days. Keep only the "promote this one" checkpoints indefinitely.
Cost Math
For a 70B model with a 350GB checkpoint every hour, running a 7-day training job:
| Approach | Storage | Request | Egress | Total/run |
|---|---|---|---|---|
| Direct to S3 Standard, keep all | $1,380 | $20 | $0 (in-region) | ~$1,400 |
| Local NVMe + async S3, lifecycle to Glacier after 7d | $350 | $15 | $0 | ~$365 |
| Training Pipes gateway, lifecycle policy | $280 | included | $0 | ~$280 |
The savings come from two places: cheaper storage on the cold tail (lifecycle rules), and no per-request overhead for gateway-managed writes.
Things That Will Bite You
torch.save in the Main Loop
If you call torch.save synchronously, the training step blocks. Use background threads or framework-level checkpointing (FSDP, DeepSpeed) that handles this for you.
Too-Aggressive Cadence
Checkpointing every 5 minutes feels safe; it costs a fortune. Checkpoint often enough to bound failure loss, no more.
Non-Atomic Writes
Avoid writing directly to the final filename. Always temp-file + rename. This requires your filesystem to have real rename semantics — FUSE S3 mounts don't.
No Checkpoint Retention Policy
Without a lifecycle rule, you'll find 6TB of two-year-old checkpoints nobody remembers. Automate cleanup.
Checkpointing to a Different Region Than You're Training In
Cross-region writes are slow and expensive. Keep your checkpoint target in the same region as your cluster; replicate later if you need DR.
Why This Is a Training Pipes Sweet Spot
Checkpointing is where object storage's lack of POSIX hurts most (rename!) and where caching matters least for reads but most for writes. A caching NFS gateway gives you:
- Real atomic rename
- Fast local-NVMe write acknowledgment
- Asynchronous durability to cheap object storage
- Lifecycle rules on the backing bucket
- The same mount your training data is on