Sunday, May 24, 2026·5 min read

Checkpointing Large Models: A Storage Guide for ML Engineers

Training Pipes Team

AI and machine learning concept visualization

Checkpointing a large model is where storage systems get embarrassed. Training a 70B-parameter model produces a 500GB+ checkpoint every hour. If that write blocks the training loop, your GPUs sit idle. If it fails, you lose a run. If you save a copy every step to be safe, your storage bill explodes.

This post is a practical guide to designing a checkpointing pipeline that's fast, durable, and affordable.

The Checkpoint Budget

Start with the constraints:

Write size: 50GB – 1TB per checkpoint, depending on model size and optimizer state.

Cadence: typically every 30 minutes to every few hours. Shorter cadences mean less lost work on failure, but more storage pressure.

Durability: must survive node failure. A checkpoint only on the node that wrote it is not a checkpoint.

Read cadence: rare. You resume from a checkpoint if something breaks. Most checkpoints are never read.

Tail: you keep many checkpoints. Sometimes dozens. Sometimes you promote one to "this is the model we ship."

Why Checkpoint Writes Are Hard

1. They're Bursty

A training loop produces a flat throughput of data reads. A checkpoint writes hundreds of gigabytes in a few seconds, then nothing for an hour. Your storage needs to handle the burst, not the average.

2. They Need to Be Atomic

A half-written checkpoint is worse than no checkpoint. If a node dies mid-write, the partial file shouldn't poison the restart. POSIX atomic rename is the usual answer: write to a temp name, fsync, rename to the final name. This requires real filesystem semantics — not all object-backed filesystems give you that.

3. They Block Training (If You're Not Careful)

Naively, torch.save(state, path) pauses the training step until the bytes are flushed. A 500GB write at 1GB/s is eight minutes of stalled GPUs. That's expensive.

4. They Multiply

Saving every hour for a week = 168 checkpoints. At 500GB each = 84TB. Do this for several concurrent runs and your checkpoint storage becomes the dominant storage cost.

Architecture Patterns

Pattern 1: Local Disk + Background Upload

Write to local NVMe (fast), then upload to durable storage in the background.

Pros: training resumes immediately. Local NVMe is very fast.

Cons: window of vulnerability between local write and durable copy. Node death during that window = lost checkpoint.

Pattern 2: Direct to Object Storage (S3 API)

Write the checkpoint via multipart upload directly to S3.

Pros: durable immediately once upload completes. Cheap.

Cons: upload bandwidth becomes the bottleneck. Writing 500GB to S3 over a single TCP connection at 100 MB/s is 80 minutes. Parallel multipart helps but requires careful tuning.

Pattern 3: NFS to a Caching Gateway

Write to a mounted NFS filesystem. The gateway handles:

Write-back caching: writes land on local NVMe at full speed, training resumes immediately.
Async durability: bytes drain to object storage in the background.
Atomic rename: POSIX rename semantics work correctly.
Asynchronous multipart upload: you don't manage the S3 API directly.

Pros: fastest write experience. Training barely notices the checkpoint. Durable within seconds.

Cons: briefly vulnerable during the drain window (measured in seconds, not minutes).

Pattern 4: Sharded Checkpoints (Distributed Writes)

With large distributed training (FSDP, DeepSpeed, Megatron), the model state is already sharded across nodes. Each node writes its own shard in parallel. This turns one 500GB write into eight 62GB concurrent writes, and total wall time drops accordingly.

Most teams combine Pattern 4 with one of Patterns 1-3 for the actual write target.

A Specific Recipe That Works

Here's a checkpoint pipeline we've seen work at scale:

1. Shard the Write Across Nodes

Use FSDP or DeepSpeed's checkpointing to give each rank its own piece of the state.

2. Mount a Training Pipes NFS Gateway on Every Node

sudo mount -t nfs4 $gateway:/checkpoints /mnt/ckpt

All nodes share the same namespace. Rank 0 writes its shard to /mnt/ckpt/step_1000/rank_0, rank 1 to rank_1, etc.

3. Write with Async Drain

def save_shard(state, path):
    tmp = path + ".tmp"
    torch.save(state, tmp)
    os.fsync(open(tmp, "rb").fileno())
    os.rename(tmp, path)

The gateway buffers the write to NVMe and acknowledges quickly. Your training loop moves on. Bytes drain to object storage in the background.

4. Barrier Before Continuing

torch.distributed.barrier()

Once all ranks confirm their write, the checkpoint is logically complete. The gateway continues draining to durable storage asynchronously.

5. Lifecycle Policies on the Object Store

Don't keep every checkpoint forever. Configure your bucket to move older checkpoints to a cheaper storage class after N days, and to delete after M days. Keep only the "promote this one" checkpoints indefinitely.

Cost Math

For a 70B model with a 350GB checkpoint every hour, running a 7-day training job:

Approach	Storage	Request	Egress	Total/run
Direct to S3 Standard, keep all	$1,380	$20	$0 (in-region)	~$1,400
Local NVMe + async S3, lifecycle to Glacier after 7d	$350	$15	$0	~$365
Training Pipes gateway, lifecycle policy	$280	included	$0	~$280

The savings come from two places: cheaper storage on the cold tail (lifecycle rules), and no per-request overhead for gateway-managed writes.

Things That Will Bite You

`torch.save` in the Main Loop

If you call torch.save synchronously, the training step blocks. Use background threads or framework-level checkpointing (FSDP, DeepSpeed) that handles this for you.

Too-Aggressive Cadence

Checkpointing every 5 minutes feels safe; it costs a fortune. Checkpoint often enough to bound failure loss, no more.

Non-Atomic Writes

Avoid writing directly to the final filename. Always temp-file + rename. This requires your filesystem to have real rename semantics — FUSE S3 mounts don't.

No Checkpoint Retention Policy

Without a lifecycle rule, you'll find 6TB of two-year-old checkpoints nobody remembers. Automate cleanup.

Checkpointing to a Different Region Than You're Training In

Cross-region writes are slow and expensive. Keep your checkpoint target in the same region as your cluster; replicate later if you need DR.

Why This Is a Training Pipes Sweet Spot

Checkpointing is where object storage's lack of POSIX hurts most (rename!) and where caching matters least for reads but most for writes. A caching NFS gateway gives you:

Real atomic rename
Fast local-NVMe write acknowledgment
Asynchronous durability to cheap object storage
Lifecycle rules on the backing bucket
The same mount your training data is on

Set up a checkpoint-ready mount →

Back to blog

Sunday, May 24, 2026·5 min read

Checkpointing Large Models: A Storage Guide for ML Engineers

The Checkpoint Budget

Why Checkpoint Writes Are Hard

1. They're Bursty

2. They Need to Be Atomic

3. They Block Training (If You're Not Careful)

4. They Multiply

Architecture Patterns

Pattern 1: Local Disk + Background Upload

Pattern 2: Direct to Object Storage (S3 API)

Pattern 3: NFS to a Caching Gateway

Pattern 4: Sharded Checkpoints (Distributed Writes)

A Specific Recipe That Works

1. Shard the Write Across Nodes

2. Mount a Training Pipes NFS Gateway on Every Node

3. Write with Async Drain

4. Barrier Before Continuing

5. Lifecycle Policies on the Object Store

Cost Math

Things That Will Bite You

`torch.save` in the Main Loop

Too-Aggressive Cadence

Non-Atomic Writes

No Checkpoint Retention Policy

Checkpointing to a Different Region Than You're Training In

Why This Is a Training Pipes Sweet Spot

Product

Company

Support

Legal

Sunday, May 24, 2026·5 min read

Checkpointing Large Models: A Storage Guide for ML Engineers

The Checkpoint Budget

Why Checkpoint Writes Are Hard

1. They're Bursty

2. They Need to Be Atomic

3. They Block Training (If You're Not Careful)

4. They Multiply

Architecture Patterns

Pattern 1: Local Disk + Background Upload

Pattern 2: Direct to Object Storage (S3 API)

Pattern 3: NFS to a Caching Gateway

Pattern 4: Sharded Checkpoints (Distributed Writes)

A Specific Recipe That Works

1. Shard the Write Across Nodes

2. Mount a Training Pipes NFS Gateway on Every Node

3. Write with Async Drain

4. Barrier Before Continuing

5. Lifecycle Policies on the Object Store

Cost Math

Things That Will Bite You

torch.save in the Main Loop

Too-Aggressive Cadence

Non-Atomic Writes

No Checkpoint Retention Policy

Checkpointing to a Different Region Than You're Training In

Why This Is a Training Pipes Sweet Spot

Product

Company

Support

Legal

`torch.save` in the Main Loop