Checkpoint Engine Integration
The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.
Overview
The checkpoint engine integration allows SGLang to:
- Load model weights in parallel using multiple processes
- Distribute weight loading across multiple nodes to increase effective disk bandwidth
- Overlap weight loading with other initialization tasks like CUDA graph capture
- Support both single-node and multi-node deployments
Installation
First, install the checkpoint engine package:
pip install 'checkpoint-engine[p2p]'
Architecture
The system consists of two main components:
- SGLang Server: Runs with
--wait-for-initial-weightsflag to wait for weights before becoming ready - Checkpoint Engine Workers: Separate processes (managed by torchrun) that load and distribute model weights
The checkpoint engine uses a parameter server architecture with support for:
- Broadcast mode: Weights are broadcast from loading processes to inference processes
- P2P mode: Direct peer-to-peer weight transfer between processes
- All mode: Combination of both broadcast and P2P methods
Usage Examples
Single Node Setup
Terminal 1 - Launch SGLang Server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights
Terminal 2 - Run Checkpoint Engine:
Using sglang entrypoint:
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Using torchrun directly:
torchrun --nproc-per-node 8 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Multi-Node Setup (2 Nodes)
Node 0:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP]
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Node 1:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP]
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
Multi-Node Setup with Tensor Parallelism (TP=16)
Node 0:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP] \
--dist-init-addr [IP]:9120 \
--nnodes 2 \
--node-rank 0
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Node 1:
Launch SGLang server:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP] \
--dist-init-addr [IP]:9120 \
--nnodes 2 \
--node-rank 1
Run checkpoint engine:
Using sglang entrypoint (recommended):
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Using torchrun directly:
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
Configuration Options
SGLang Server Options
--load-format dummy: Use dummy format for initial loading (allows overlapping with other tasks)--wait-for-initial-weights: Wait for checkpoint engine to provide weights before becoming ready--host: Host address for multi-node setups--dist-init-addr: Distributed initialization address for tensor parallelism
Checkpoint Engine Options
--update-method: Weight update method (broadcast,p2p, orall)--checkpoint-path: Path to model checkpoint directory--inference-parallel-size: Number of inference parallel processes--endpoint: SGLang server endpoint (default:http://localhost:19730)--checkpoint-name: Name for the checkpoint (default:my-checkpoint-iter-0)--save-metas-file: File to save checkpoint metadata--load-metas-file: File to load checkpoint metadata from--uds: Unix domain socket path for communication--weight-version: Version identifier for weights
Performance Benefits
The checkpoint engine provides significant time savings in two main aspects:
Multi-node Loading: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
Single Process Optimization: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
Troubleshooting
- Ensure checkpoint engine package is installed:
pip install 'checkpoint-engine[p2p]' - Verify network connectivity between nodes in multi-node setups
- Check that the checkpoint path contains valid model files
- Monitor logs for connection errors between SGLang server and checkpoint engine
- Use
--sleep-timeparameter to add delays if needed for debugging