File size: 7,356 Bytes
6268841 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | # Checkpoint Engine Integration
The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.
## Overview
The checkpoint engine integration allows SGLang to:
- Load model weights in parallel using multiple processes
- Distribute weight loading across multiple nodes to increase effective disk bandwidth
- Overlap weight loading with other initialization tasks like CUDA graph capture
- Support both single-node and multi-node deployments
## Installation
First, install the checkpoint engine package:
```bash
pip install 'checkpoint-engine[p2p]'
```
## Architecture
The system consists of two main components:
1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready
2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights
The checkpoint engine uses a parameter server architecture with support for:
- **Broadcast mode**: Weights are broadcast from loading processes to inference processes
- **P2P mode**: Direct peer-to-peer weight transfer between processes
- **All mode**: Combination of both broadcast and P2P methods
## Usage Examples
### Single Node Setup
**Terminal 1 - Launch SGLang Server:**
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights
```
**Terminal 2 - Run Checkpoint Engine:**
Using sglang entrypoint:
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
### Multi-Node Setup (2 Nodes)
**Node 0:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP]
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
**Node 1:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP]
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 8
```
### Multi-Node Setup with Tensor Parallelism (TP=16)
**Node 0:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP] \
--dist-init-addr [IP]:9120 \
--nnodes 2 \
--node-rank 0
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 0 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
**Node 1:**
Launch SGLang server:
```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B \
--tp 8 \
--load-format dummy \
--wait-for-initial-weights \
--host [IP] \
--dist-init-addr [IP]:9120 \
--nnodes 2 \
--node-rank 1
```
Run checkpoint engine:
Using sglang entrypoint (recommended):
```bash
python -m sglang.srt.checkpoint_engine.update \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
Using torchrun directly:
```bash
torchrun --nproc-per-node 8 \
--nnodes 2 \
--node-rank 1 \
--master-addr [IP] \
--master-port 29500 \
examples/checkpoint_engine/update.py \
--update-method broadcast \
--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
--inference-parallel-size 16
```
## Configuration Options
### SGLang Server Options
- `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks)
- `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready
- `--host`: Host address for multi-node setups
- `--dist-init-addr`: Distributed initialization address for tensor parallelism
### Checkpoint Engine Options
- `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`)
- `--checkpoint-path`: Path to model checkpoint directory
- `--inference-parallel-size`: Number of inference parallel processes
- `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`)
- `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`)
- `--save-metas-file`: File to save checkpoint metadata
- `--load-metas-file`: File to load checkpoint metadata from
- `--uds`: Unix domain socket path for communication
- `--weight-version`: Version identifier for weights
## Performance Benefits
The checkpoint engine provides significant time savings in two main aspects:
1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
## Troubleshooting
- Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'`
- Verify network connectivity between nodes in multi-node setups
- Check that the checkpoint path contains valid model files
- Monitor logs for connection errors between SGLang server and checkpoint engine
- Use `--sleep-time` parameter to add delays if needed for debugging
## References
- [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine)
|