Checkpoint Engine

Overview

Checkpoint Engine is an unified abstract layer to synchronize weights between various training backends and inference backends. It provides three unified APIs:

send_weights: get named tensors from generator and send them in streaming manner.
receive_weights: return a tensor generator that yield named tensors in streaming manner.
get_weights: return a tensor generator that yield named tensors in streaming manner, used for each inference instance update weight independently from local cache (e.g share memory, disk).

Supported Backends

	Comm Library	Topology	Hardware	Performance	Elastic	Use case
naive	torch.distributed	all_gather	NVIDIA/AMD/Ascend	Very High	NA	On-policy training - Trainer/rollout colocated
nccl	NCCL	all_gather+broadcast	NVIDIA GPU & NCCL	Very High	Low: rebuild nccl group	Off-policy training - Trainer/rollout disaggregated - Fixed clusters
hccl	HCCL	all_gather+broadcast	Ascend NPU & HCCL	High	Low: rebuild hccl group	Off-policy training - Trainer/rollout disaggregated - Fixed clusters
nixl	NIXL	all_gather+ring p2p	Various transport backends (D2D, H2H, H2D, etc) - UCX - UCCL - Mooncacke	Medium/High	High: dynamic adjust ring topology	Off-policy training - Trainer/rollout disaggregated - Elastic rollout - Rollout fault tolerance - Heterogeneous hardware rollout
kimi_ckpt_engine	MOONCAKE+NCCL/HCCL	p2p+broadcast	NVIDIA/Ascend	High	Low: rebuild communication group	Off-policy training - Trainer/rollout disaggregated - Save checkpoint each time
mooncake	Mooncake Transfer Engine	all_gather+ring p2p	NVIDIA/Ascend	High	High: dynamic adjust ring topology	Off-policy training - Trainer/rollout disaggregated - Fixed clusters

kimi_ckpt_engine detail:

In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.

This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.

In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: transfer-engine: ascend direct

Note: Important Configuration for Ascend Devices If you are using CANN version >= 8.5.0 on Ascend devices, you must set the following environment variable to enable intra-node ROCE:

export HCCL_INTRA_ROCE_ENABLE=1

Benchmark

benchmark setup

model: Qwen/Qwen3-30B-A3B-Base
trainer: fsdp world_size=2 (since Ascend 910C has 64GB of HBM, we set world_size=4)
rollout: num_rollout=30 (only receive weight without cuda ipc to vllm/sglang)

pytest tests/checkpoint_engine/test_correctness_on_gpu.py
pytest tests/checkpoint_engine/test_correctness_on_npu.py
pytest tests/checkpoint_engine/test_special_server_adapter.py

benchmark result

hardware	backend	time cost (s)	Bandwidth(GB/s)
4*8 H100, ConnectX-7 400 Gbps (InfiniBand)	NCCL	~7	8.25
4*8 H100, ConnectX-7 400 Gbps (InfiniBand)	NIXL	~7	8.25
2*16 Ascend 910C, inner suppernode	HCCL	~11	5.3
2*16 Ascend 910C, inner suppernode	kimi_ckpt_engine	offload: 7 update: 3.5	16.5
2*8 H100, ConnectX-7 400 Gbps (InfiniBand)	mooncake	5.93	9.44