Hanrui / sglang /docs /advanced_features /checkpoint_engine.md

Add files using upload-large-folder tool

6268841 verified 27 days ago

7.36 kB

Checkpoint Engine Integration

The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.

Overview

The checkpoint engine integration allows SGLang to:

Load model weights in parallel using multiple processes
Distribute weight loading across multiple nodes to increase effective disk bandwidth
Overlap weight loading with other initialization tasks like CUDA graph capture
Support both single-node and multi-node deployments

Installation

First, install the checkpoint engine package:

pip install 'checkpoint-engine[p2p]'

Architecture

The system consists of two main components:

SGLang Server: Runs with --wait-for-initial-weights flag to wait for weights before becoming ready
Checkpoint Engine Workers: Separate processes (managed by torchrun) that load and distribute model weights

The checkpoint engine uses a parameter server architecture with support for:

Broadcast mode: Weights are broadcast from loading processes to inference processes
P2P mode: Direct peer-to-peer weight transfer between processes
All mode: Combination of both broadcast and P2P methods

Usage Examples

Single Node Setup

Terminal 1 - Launch SGLang Server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights

Terminal 2 - Run Checkpoint Engine:

Using sglang entrypoint:

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Using torchrun directly:

torchrun --nproc-per-node 8 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Multi-Node Setup (2 Nodes)

Node 0:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP]

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Node 1:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP]

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 8

Multi-Node Setup with Tensor Parallelism (TP=16)

Node 0:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP] \
    --dist-init-addr [IP]:9120 \
    --nnodes 2 \
    --node-rank 0

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 0 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Node 1:

Launch SGLang server:

python -m sglang.launch_server \
    --model-path Qwen/Qwen3-8B \
    --tp 8 \
    --load-format dummy \
    --wait-for-initial-weights \
    --host [IP] \
    --dist-init-addr [IP]:9120 \
    --nnodes 2 \
    --node-rank 1

Run checkpoint engine:

Using sglang entrypoint (recommended):

python -m sglang.srt.checkpoint_engine.update \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Using torchrun directly:

torchrun --nproc-per-node 8 \
    --nnodes 2 \
    --node-rank 1 \
    --master-addr [IP] \
    --master-port 29500 \
    examples/checkpoint_engine/update.py \
    --update-method broadcast \
    --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
    --inference-parallel-size 16

Configuration Options

SGLang Server Options

--load-format dummy: Use dummy format for initial loading (allows overlapping with other tasks)
--wait-for-initial-weights: Wait for checkpoint engine to provide weights before becoming ready
--host: Host address for multi-node setups
--dist-init-addr: Distributed initialization address for tensor parallelism

Checkpoint Engine Options

--update-method: Weight update method (broadcast, p2p, or all)
--checkpoint-path: Path to model checkpoint directory
--inference-parallel-size: Number of inference parallel processes
--endpoint: SGLang server endpoint (default: http://localhost:19730)
--checkpoint-name: Name for the checkpoint (default: my-checkpoint-iter-0)
--save-metas-file: File to save checkpoint metadata
--load-metas-file: File to load checkpoint metadata from
--uds: Unix domain socket path for communication
--weight-version: Version identifier for weights

Performance Benefits

The checkpoint engine provides significant time savings in two main aspects:

Multi-node Loading: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
Single Process Optimization: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.

Troubleshooting

Ensure checkpoint engine package is installed: pip install 'checkpoint-engine[p2p]'
Verify network connectivity between nodes in multi-node setups
Check that the checkpoint path contains valid model files
Monitor logs for connection errors between SGLang server and checkpoint engine
Use --sleep-time parameter to add delays if needed for debugging

References

Checkpoint Engine Repository