| # Multi-Node Deployment |
|
|
| ## Llama 3.1 405B |
|
|
| **Run 405B (fp16) on Two Nodes** |
|
|
| ```bash |
| # replace 172.16.4.52:20000 with your own node ip address and port of the first node |
| |
| python3 -m sglang.launch_server \ |
| --model-path meta-llama/Meta-Llama-3.1-405B-Instruct \ |
| --tp 16 \ |
| --dist-init-addr 172.16.4.52:20000 \ |
| --nnodes 2 \ |
| --node-rank 0 |
| |
| python3 -m sglang.launch_server \ |
| --model-path meta-llama/Meta-Llama-3.1-405B-Instruct \ |
| --tp 16 \ |
| --dist-init-addr 172.16.4.52:20000 \ |
| --nnodes 2 \ |
| --node-rank 1 |
| ``` |
|
|
| Note that LLama 405B (fp8) can also be launched on a single node. |
|
|
| ```bash |
| python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8 |
| ``` |
|
|
| ## DeepSeek V3/R1 |
|
|
| Please refer to [DeepSeek documents for reference](https://docs.sglang.io/basic_usage/deepseek.html#running-examples-on-multi-node). |
|
|
| ## Multi-Node Inference on SLURM |
|
|
| This example showcases how to serve SGLang server across multiple nodes by SLURM. Submit the following job to the SLURM cluster. |
|
|
| ``` |
| #!/bin/bash -l |
| |
| #SBATCH -o SLURM_Logs/%x_%j_master.out |
| #SBATCH -e SLURM_Logs/%x_%j_master.err |
| #SBATCH -D ./ |
| #SBATCH -J Llama-405B-Online-Inference-TP16-SGL |
| |
| #SBATCH --nodes=2 |
| #SBATCH --ntasks=2 |
| #SBATCH --ntasks-per-node=1 # Ensure 1 task per node |
| #SBATCH --cpus-per-task=18 |
| #SBATCH --mem=224GB |
| #SBATCH --partition="lmsys.org" |
| #SBATCH --gres=gpu:8 |
| #SBATCH --time=12:00:00 |
| |
| echo "[INFO] Activating environment on node $SLURM_PROCID" |
| if ! source ENV_FOLDER/bin/activate; then |
| echo "[ERROR] Failed to activate environment" >&2 |
| exit 1 |
| fi |
| |
| # Define parameters |
| model=MODEL_PATH |
| tp_size=16 |
| |
| echo "[INFO] Running inference" |
| echo "[INFO] Model: $model" |
| echo "[INFO] TP Size: $tp_size" |
| |
| # Set NCCL initialization address using the hostname of the head node |
| HEAD_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n 1) |
| NCCL_INIT_ADDR="${HEAD_NODE}:8000" |
| echo "[INFO] NCCL_INIT_ADDR: $NCCL_INIT_ADDR" |
| |
| # Launch the model server on each node using SLURM |
| srun --ntasks=2 --nodes=2 --output="SLURM_Logs/%x_%j_node$SLURM_NODEID.out" \ |
| --error="SLURM_Logs/%x_%j_node$SLURM_NODEID.err" \ |
| python3 -m sglang.launch_server \ |
| --model-path "$model" \ |
| --grammar-backend "xgrammar" \ |
| --tp "$tp_size" \ |
| --dist-init-addr "$NCCL_INIT_ADDR" \ |
| --nnodes 2 \ |
| --node-rank "$SLURM_NODEID" & |
| |
| # Wait for the NCCL server to be ready on port 30000 |
| while ! nc -z "$HEAD_NODE" 30000; do |
| sleep 1 |
| echo "[INFO] Waiting for $HEAD_NODE:30000 to accept connections" |
| done |
| |
| echo "[INFO] $HEAD_NODE:30000 is ready to accept connections" |
| |
| # Keep the script running until the SLURM job times out |
| wait |
| ``` |
|
|
| Then, you can test the server by sending requests following other [documents](https://docs.sglang.io/basic_usage/openai_api_completions.html). |
|
|
| Thanks for [aflah02](https://github.com/aflah02) for providing the example, based on his [blog post](https://aflah02.substack.com/p/multi-node-llm-inference-with-sglang). |
|
|