| # Checkpoint Engine Integration |
|
|
| The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes. |
|
|
| ## Overview |
|
|
| The checkpoint engine integration allows SGLang to: |
| - Load model weights in parallel using multiple processes |
| - Distribute weight loading across multiple nodes to increase effective disk bandwidth |
| - Overlap weight loading with other initialization tasks like CUDA graph capture |
| - Support both single-node and multi-node deployments |
|
|
| ## Installation |
|
|
| First, install the checkpoint engine package: |
|
|
| ```bash |
| pip install 'checkpoint-engine[p2p]' |
| ``` |
|
|
| ## Architecture |
|
|
| The system consists of two main components: |
|
|
| 1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready |
| 2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights |
|
|
| The checkpoint engine uses a parameter server architecture with support for: |
| - **Broadcast mode**: Weights are broadcast from loading processes to inference processes |
| - **P2P mode**: Direct peer-to-peer weight transfer between processes |
| - **All mode**: Combination of both broadcast and P2P methods |
|
|
| ## Usage Examples |
|
|
| ### Single Node Setup |
|
|
| **Terminal 1 - Launch SGLang Server:** |
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tp 8 \ |
| --load-format dummy \ |
| --wait-for-initial-weights |
| ``` |
|
|
| **Terminal 2 - Run Checkpoint Engine:** |
|
|
| Using sglang entrypoint: |
| ```bash |
| python -m sglang.srt.checkpoint_engine.update \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 8 |
| ``` |
|
|
| Using torchrun directly: |
| ```bash |
| torchrun --nproc-per-node 8 \ |
| examples/checkpoint_engine/update.py \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 8 |
| ``` |
|
|
| ### Multi-Node Setup (2 Nodes) |
|
|
| **Node 0:** |
|
|
| Launch SGLang server: |
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tp 8 \ |
| --load-format dummy \ |
| --wait-for-initial-weights \ |
| --host [IP] |
| ``` |
|
|
| Run checkpoint engine: |
|
|
| Using sglang entrypoint (recommended): |
| ```bash |
| python -m sglang.srt.checkpoint_engine.update \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 8 |
| ``` |
|
|
| Using torchrun directly: |
| ```bash |
| torchrun --nproc-per-node 8 \ |
| --nnodes 2 \ |
| --node-rank 0 \ |
| --master-addr [IP] \ |
| --master-port 29500 \ |
| examples/checkpoint_engine/update.py \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 8 |
| ``` |
|
|
| **Node 1:** |
|
|
| Launch SGLang server: |
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tp 8 \ |
| --load-format dummy \ |
| --wait-for-initial-weights \ |
| --host [IP] |
| ``` |
|
|
| Run checkpoint engine: |
|
|
| Using sglang entrypoint (recommended): |
| ```bash |
| python -m sglang.srt.checkpoint_engine.update \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 8 |
| ``` |
|
|
| Using torchrun directly: |
| ```bash |
| torchrun --nproc-per-node 8 \ |
| --nnodes 2 \ |
| --node-rank 1 \ |
| --master-addr [IP] \ |
| --master-port 29500 \ |
| examples/checkpoint_engine/update.py \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 8 |
| ``` |
|
|
| ### Multi-Node Setup with Tensor Parallelism (TP=16) |
|
|
| **Node 0:** |
|
|
| Launch SGLang server: |
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tp 8 \ |
| --load-format dummy \ |
| --wait-for-initial-weights \ |
| --host [IP] \ |
| --dist-init-addr [IP]:9120 \ |
| --nnodes 2 \ |
| --node-rank 0 |
| ``` |
|
|
| Run checkpoint engine: |
|
|
| Using sglang entrypoint (recommended): |
| ```bash |
| python -m sglang.srt.checkpoint_engine.update \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 16 |
| ``` |
|
|
| Using torchrun directly: |
| ```bash |
| torchrun --nproc-per-node 8 \ |
| --nnodes 2 \ |
| --node-rank 0 \ |
| --master-addr [IP] \ |
| --master-port 29500 \ |
| examples/checkpoint_engine/update.py \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 16 |
| ``` |
|
|
| **Node 1:** |
|
|
| Launch SGLang server: |
| ```bash |
| python -m sglang.launch_server \ |
| --model-path Qwen/Qwen3-8B \ |
| --tp 8 \ |
| --load-format dummy \ |
| --wait-for-initial-weights \ |
| --host [IP] \ |
| --dist-init-addr [IP]:9120 \ |
| --nnodes 2 \ |
| --node-rank 1 |
| ``` |
|
|
| Run checkpoint engine: |
|
|
| Using sglang entrypoint (recommended): |
| ```bash |
| python -m sglang.srt.checkpoint_engine.update \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 16 |
| ``` |
|
|
| Using torchrun directly: |
| ```bash |
| torchrun --nproc-per-node 8 \ |
| --nnodes 2 \ |
| --node-rank 1 \ |
| --master-addr [IP] \ |
| --master-port 29500 \ |
| examples/checkpoint_engine/update.py \ |
| --update-method broadcast \ |
| --checkpoint-path /path/to/Qwen/Qwen3-8B/ \ |
| --inference-parallel-size 16 |
| ``` |
|
|
| ## Configuration Options |
|
|
| ### SGLang Server Options |
|
|
| - `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks) |
| - `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready |
| - `--host`: Host address for multi-node setups |
| - `--dist-init-addr`: Distributed initialization address for tensor parallelism |
|
|
| ### Checkpoint Engine Options |
|
|
| - `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`) |
| - `--checkpoint-path`: Path to model checkpoint directory |
| - `--inference-parallel-size`: Number of inference parallel processes |
| - `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`) |
| - `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`) |
| - `--save-metas-file`: File to save checkpoint metadata |
| - `--load-metas-file`: File to load checkpoint metadata from |
| - `--uds`: Unix domain socket path for communication |
| - `--weight-version`: Version identifier for weights |
|
|
| ## Performance Benefits |
|
|
| The checkpoint engine provides significant time savings in two main aspects: |
|
|
| 1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes. |
|
|
| 2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings. |
|
|
| ## Troubleshooting |
|
|
| - Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'` |
| - Verify network connectivity between nodes in multi-node setups |
| - Check that the checkpoint path contains valid model files |
| - Monitor logs for connection errors between SGLang server and checkpoint engine |
| - Use `--sleep-time` parameter to add delays if needed for debugging |
|
|
| ## References |
|
|
| - [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine) |
|
|