Checkpoint Engine
Overview
Checkpoint Engine is an unified abstract layer to synchronize weights between various training backends and inference backends. It provides three unified APIs:
- send_weights: get named tensors from generator and send them in streaming manner.
- receive_weights: return a tensor generator that yield named tensors in streaming manner.
- get_weights: return a tensor generator that yield named tensors in streaming manner, used for each inference instance update weight independently from local cache (e.g share memory, disk).
Supported Backends
| Comm Library | Topology | Hardware | Performance | Elastic | Use case | |
|---|---|---|---|---|---|---|
| naive | torch.distributed | all_gather | NVIDIA/AMD/Ascend | Very High | NA | On-policy training - Trainer/rollout colocated |
| nccl | NCCL | all_gather+broadcast | NVIDIA GPU & NCCL | Very High | Low: rebuild nccl group | Off-policy training - Trainer/rollout disaggregated - Fixed clusters |
| hccl | HCCL | all_gather+broadcast | Ascend NPU & HCCL | High | Low: rebuild hccl group | Off-policy training - Trainer/rollout disaggregated - Fixed clusters |
| nixl | NIXL | all_gather+ring p2p | Various transport backends (D2D, H2H, H2D, etc) - UCX - UCCL - Mooncacke |
Medium/High | High: dynamic adjust ring topology | Off-policy training - Trainer/rollout disaggregated - Elastic rollout - Rollout fault tolerance - Heterogeneous hardware rollout |
| kimi_ckpt_engine | MOONCAKE+NCCL/HCCL | p2p+broadcast | NVIDIA/Ascend | High | Low: rebuild communication group | Off-policy training - Trainer/rollout disaggregated - Save checkpoint each time |
| mooncake | Mooncake Transfer Engine | all_gather+ring p2p | NVIDIA/Ascend | High | High: dynamic adjust ring topology | Off-policy training - Trainer/rollout disaggregated - Fixed clusters |
kimi_ckpt_engine detail:
In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.
This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.
In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: transfer-engine: ascend direct
Note: Important Configuration for Ascend Devices If you are using CANN version >= 8.5.0 on Ascend devices, you must set the following environment variable to enable intra-node ROCE:
export HCCL_INTRA_ROCE_ENABLE=1
Benchmark
- benchmark setup
- model: Qwen/Qwen3-30B-A3B-Base
- trainer: fsdp world_size=2 (since Ascend 910C has 64GB of HBM, we set world_size=4)
- rollout: num_rollout=30 (only receive weight without cuda ipc to vllm/sglang)
pytest tests/checkpoint_engine/test_correctness_on_gpu.py
pytest tests/checkpoint_engine/test_correctness_on_npu.py
pytest tests/checkpoint_engine/test_special_server_adapter.py
- benchmark result
| hardware | backend | time cost (s) | Bandwidth(GB/s) |
|---|---|---|---|
| 4*8 H100, ConnectX-7 400 Gbps (InfiniBand) | NCCL | ~7 | 8.25 |
| 4*8 H100, ConnectX-7 400 Gbps (InfiniBand) | NIXL | ~7 | 8.25 |
| 2*16 Ascend 910C, inner suppernode | HCCL | ~11 | 5.3 |
| 2*16 Ascend 910C, inner suppernode | kimi_ckpt_engine | offload: 7 update: 3.5 | 16.5 |
| 2*8 H100, ConnectX-7 400 Gbps (InfiniBand) | mooncake | 5.93 | 9.44 |
