Hanrui / sglang /docs /advanced_features /checkpoint_engine.md

Add files using upload-large-folder tool

6268841 verified about 1 month ago

7.36 kB

	# Checkpoint Engine Integration

	The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.

	## Overview

	The checkpoint engine integration allows SGLang to:
	- Load model weights in parallel using multiple processes
	- Distribute weight loading across multiple nodes to increase effective disk bandwidth
	- Overlap weight loading with other initialization tasks like CUDA graph capture
	- Support both single-node and multi-node deployments

	## Installation

	First, install the checkpoint engine package:

	```bash
	pip install 'checkpoint-engine[p2p]'
	```

	## Architecture

	The system consists of two main components:

	1. SGLang Server: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready
	2. Checkpoint Engine Workers: Separate processes (managed by torchrun) that load and distribute model weights

	The checkpoint engine uses a parameter server architecture with support for:
	- Broadcast mode: Weights are broadcast from loading processes to inference processes
	- P2P mode: Direct peer-to-peer weight transfer between processes
	- All mode: Combination of both broadcast and P2P methods

	## Usage Examples

	### Single Node Setup

	Terminal 1 - Launch SGLang Server:
	```bash
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-8B \
	--tp 8 \
	--load-format dummy \
	--wait-for-initial-weights
	```

	Terminal 2 - Run Checkpoint Engine:

	Using sglang entrypoint:
	```bash
	python -m sglang.srt.checkpoint_engine.update \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 8
	```

	Using torchrun directly:
	```bash
	torchrun --nproc-per-node 8 \
	examples/checkpoint_engine/update.py \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 8
	```

	### Multi-Node Setup (2 Nodes)

	Node 0:

	Launch SGLang server:
	```bash
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-8B \
	--tp 8 \
	--load-format dummy \
	--wait-for-initial-weights \
	--host [IP]
	```

	Run checkpoint engine:

	Using sglang entrypoint (recommended):
	```bash
	python -m sglang.srt.checkpoint_engine.update \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 8
	```

	Using torchrun directly:
	```bash
	torchrun --nproc-per-node 8 \
	--nnodes 2 \
	--node-rank 0 \
	--master-addr [IP] \
	--master-port 29500 \
	examples/checkpoint_engine/update.py \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 8
	```

	Node 1:

	Launch SGLang server:
	```bash
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-8B \
	--tp 8 \
	--load-format dummy \
	--wait-for-initial-weights \
	--host [IP]
	```

	Run checkpoint engine:

	Using sglang entrypoint (recommended):
	```bash
	python -m sglang.srt.checkpoint_engine.update \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 8
	```

	Using torchrun directly:
	```bash
	torchrun --nproc-per-node 8 \
	--nnodes 2 \
	--node-rank 1 \
	--master-addr [IP] \
	--master-port 29500 \
	examples/checkpoint_engine/update.py \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 8
	```

	### Multi-Node Setup with Tensor Parallelism (TP=16)

	Node 0:

	Launch SGLang server:
	```bash
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-8B \
	--tp 8 \
	--load-format dummy \
	--wait-for-initial-weights \
	--host [IP] \
	--dist-init-addr [IP]:9120 \
	--nnodes 2 \
	--node-rank 0
	```

	Run checkpoint engine:

	Using sglang entrypoint (recommended):
	```bash
	python -m sglang.srt.checkpoint_engine.update \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 16
	```

	Using torchrun directly:
	```bash
	torchrun --nproc-per-node 8 \
	--nnodes 2 \
	--node-rank 0 \
	--master-addr [IP] \
	--master-port 29500 \
	examples/checkpoint_engine/update.py \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 16
	```

	Node 1:

	Launch SGLang server:
	```bash
	python -m sglang.launch_server \
	--model-path Qwen/Qwen3-8B \
	--tp 8 \
	--load-format dummy \
	--wait-for-initial-weights \
	--host [IP] \
	--dist-init-addr [IP]:9120 \
	--nnodes 2 \
	--node-rank 1
	```

	Run checkpoint engine:

	Using sglang entrypoint (recommended):
	```bash
	python -m sglang.srt.checkpoint_engine.update \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 16
	```

	Using torchrun directly:
	```bash
	torchrun --nproc-per-node 8 \
	--nnodes 2 \
	--node-rank 1 \
	--master-addr [IP] \
	--master-port 29500 \
	examples/checkpoint_engine/update.py \
	--update-method broadcast \
	--checkpoint-path /path/to/Qwen/Qwen3-8B/ \
	--inference-parallel-size 16
	```

	## Configuration Options

	### SGLang Server Options

	- `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks)
	- `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready
	- `--host`: Host address for multi-node setups
	- `--dist-init-addr`: Distributed initialization address for tensor parallelism

	### Checkpoint Engine Options

	- `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`)
	- `--checkpoint-path`: Path to model checkpoint directory
	- `--inference-parallel-size`: Number of inference parallel processes
	- `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`)
	- `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`)
	- `--save-metas-file`: File to save checkpoint metadata
	- `--load-metas-file`: File to load checkpoint metadata from
	- `--uds`: Unix domain socket path for communication
	- `--weight-version`: Version identifier for weights

	## Performance Benefits

	The checkpoint engine provides significant time savings in two main aspects:

	1. Multi-node Loading: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.

	2. Single Process Optimization: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.

	## Troubleshooting

	- Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'`
	- Verify network connectivity between nodes in multi-node setups
	- Check that the checkpoint path contains valid model files
	- Monitor logs for connection errors between SGLang server and checkpoint engine
	- Use `--sleep-time` parameter to add delays if needed for debugging

	## References

	- [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine)