LeTue09's picture
initial clean commit
1faccd4

Checkpoint Engine

Overview

Checkpoint Engine is an unified abstract layer to synchronize weights between various training backends and inference backends. It provides three unified APIs:

  • send_weights: get named tensors from generator and send them in streaming manner.
  • receive_weights: return a tensor generator that yield named tensors in streaming manner.
  • get_weights: return a tensor generator that yield named tensors in streaming manner, used for each inference instance update weight independently from local cache (e.g share memory, disk).

checkpoint-engine

Supported Backends

Comm Library Topology Hardware Performance Elastic Use case
naive torch.distributed all_gather NVIDIA/AMD/Ascend Very High NA On-policy training
- Trainer/rollout colocated
nccl NCCL all_gather+broadcast NVIDIA GPU & NCCL Very High Low: rebuild nccl group Off-policy training
- Trainer/rollout disaggregated
- Fixed clusters
hccl HCCL all_gather+broadcast Ascend NPU & HCCL High Low: rebuild hccl group Off-policy training
- Trainer/rollout disaggregated
- Fixed clusters
nixl NIXL all_gather+ring p2p Various transport backends (D2D, H2H, H2D, etc)
- UCX
- UCCL
- Mooncacke
Medium/High High: dynamic adjust ring topology Off-policy training
- Trainer/rollout disaggregated
- Elastic rollout
- Rollout fault tolerance
- Heterogeneous hardware rollout
kimi_ckpt_engine MOONCAKE+NCCL/HCCL p2p+broadcast NVIDIA/Ascend High Low: rebuild communication group Off-policy training
- Trainer/rollout disaggregated
- Save checkpoint each time
mooncake Mooncake Transfer Engine all_gather+ring p2p NVIDIA/Ascend High High: dynamic adjust ring topology Off-policy training
- Trainer/rollout disaggregated
- Fixed clusters
kimi_ckpt_engine detail:

In the kimi_ckpt_engine workflow, the trainer first offloads the weights to the CPU, and the rollout creates a sub communication group that includes all the cards for the rollout. Then, using Mooncake transfer engine, these weights are transmitted via P2P to a specific worker in the rollout, followed by a broadcast to all other rollout workers.

kimi-ckpt-engine

This mode requires the P2P feature of checkpoint_engine. Please ensure you have installed it via pip install 'checkpoint-engine[p2p]' and that your version is 0.4.0 or higher.

In addition, during the installation of checkpoint-engine[p2p], the transfer engine will be installed. However, This library has no prebuilt packages for Ascend devices and must be compiled from source. For detailed compilation instructions, see: transfer-engine: ascend direct

Note: Important Configuration for Ascend Devices If you are using CANN version >= 8.5.0 on Ascend devices, you must set the following environment variable to enable intra-node ROCE:

export HCCL_INTRA_ROCE_ENABLE=1

Benchmark

  1. benchmark setup
  • model: Qwen/Qwen3-30B-A3B-Base
  • trainer: fsdp world_size=2 (since Ascend 910C has 64GB of HBM, we set world_size=4)
  • rollout: num_rollout=30 (only receive weight without cuda ipc to vllm/sglang)
pytest tests/checkpoint_engine/test_correctness_on_gpu.py
pytest tests/checkpoint_engine/test_correctness_on_npu.py
pytest tests/checkpoint_engine/test_special_server_adapter.py
  1. benchmark result
hardware backend time cost (s) Bandwidth(GB/s)
4*8 H100, ConnectX-7 400 Gbps (InfiniBand) NCCL ~7 8.25
4*8 H100, ConnectX-7 400 Gbps (InfiniBand) NIXL ~7 8.25
2*16 Ascend 910C, inner suppernode HCCL ~11 5.3
2*16 Ascend 910C, inner suppernode kimi_ckpt_engine offload: 7 update: 3.5 16.5
2*8 H100, ConnectX-7 400 Gbps (InfiniBand) mooncake 5.93 9.44