Hanrui / sglang /docs /platforms /cpu_server.md

Add files using upload-large-folder tool

a227c91 verified 27 days ago

12.7 kB

	# CPU Servers

	The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
	SGLang is enabled and optimized on the CPUs equipped with Intel® AMX® Instructions,
	which are 4th generation or newer Intel® Xeon® Scalable Processors.

	## Optimized Model List

	A list of popular LLMs are optimized and run efficiently on CPU,
	including the most notable open-source models like Llama series, Qwen series,
	and DeepSeek series like DeepSeek-R1 and DeepSeek-V3.1-Terminus.

	\| Model Name \| BF16 \| W8A8_INT8 \| FP8 \|
	\|:---:\|:---:\|:---:\|:---:\|
	\| DeepSeek-R1 \| \| [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) \| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) \|
	\| DeepSeek-V3.1-Terminus \| \| [IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8](https://huggingface.co/IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8) \| [deepseek-ai/DeepSeek-V3.1-Terminus](https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus) \|
	\| Llama-3.2-3B \| [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) \| [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) \| \|
	\| Llama-3.1-8B \| [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \| [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) \| \|
	\| QwQ-32B \| \| [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) \| \|
	\| DeepSeek-Distilled-Llama \| \| [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) \| \|
	\| Qwen3-235B \| \| \| [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) \|

	Note: The model identifiers listed in the table above
	have been verified on 6th Gen Intel® Xeon® P-core platforms.

	## Installation

	### Install Using Docker

	It is recommended to use Docker for setting up the SGLang environment.
	A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile) is provided to facilitate the installation.
	Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).

	```bash
	# Clone the SGLang repository
	git clone https://github.com/sgl-project/sglang.git
	cd sglang/docker

	# Build the docker image
	docker build -t sglang-cpu:latest -f xeon.Dockerfile .

	# Initiate a docker container
	docker run \
	-it \
	--privileged \
	--ipc=host \
	--network=host \
	-v /dev/shm:/dev/shm \
	-v ~/.cache/huggingface:/root/.cache/huggingface \
	-p 30000:30000 \
	-e "HF_TOKEN=<secret>" \
	sglang-cpu:latest /bin/bash
	```

	### Install From Source

	If you prefer to install SGLang in a bare metal environment,
	the setup process is as follows:

	Please install the required packages and libraries beforehand if
	they are not already present on your system.
	You can refer to the Ubuntu-based installation commands in
	[the Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/xeon.Dockerfile#L11)
	for guidance.

	1. Install `uv` package manager, then create and activate a virtual environment:

	```bash
	# Taking '/opt' as the example uv env folder, feel free to change it as needed
	cd /opt
	curl -LsSf https://astral.sh/uv/install.sh \| sh
	source $HOME/.local/bin/env
	uv venv --python 3.12
	source .venv/bin/activate
	```

	2. Create a config file to direct the installation channel
	(a.k.a. index-url) of `torch` related packages:

	```bash
	vim .venv/uv.toml
	```

	Press 'a' to enter insert mode of `vim`, paste the following content into the created file

	```file
	[[index]]
	name = "torch"
	url = "https://download.pytorch.org/whl/cpu"

	[[index]]
	name = "torchvision"
	url = "https://download.pytorch.org/whl/cpu"

	[[index]]
	name = "torchaudio"
	url = "https://download.pytorch.org/whl/cpu"

	[[index]]
	name = "triton"
	url = "https://download.pytorch.org/whl/cpu"

	```

	Save the file (in `vim`, press 'esc' to exit insert mode, then ':x+Enter'),
	and set it as the default `uv` config.

	```bash
	export UV_CONFIG_FILE=/opt/.venv/uv.toml
	```

	3. Clone the `sglang` source code and build the packages

	```bash
	# Clone the SGLang code
	git clone https://github.com/sgl-project/sglang.git
	cd sglang
	git checkout <YOUR-DESIRED-VERSION>

	# Use dedicated toml file
	cd python
	cp pyproject_cpu.toml pyproject.toml
	# Install SGLang dependent libs, and build SGLang main package
	uv pip install --upgrade pip setuptools
	uv pip install .

	# Build the CPU backend kernels
	cd ../sgl-kernel
	cp pyproject_cpu.toml pyproject.toml
	uv pip install .
	```

	4. Set the required environment variables

	```bash
	export SGLANG_USE_CPU_ENGINE=1

	# Set 'LD_LIBRARY_PATH' and 'LD_PRELOAD' to ensure the libs can be loaded by sglang processes
	export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
	export LD_PRELOAD=${LD_PRELOAD}:/opt/.venv/lib/libiomp5.so:${LD_LIBRARY_PATH}/libtcmalloc.so.4:${LD_LIBRARY_PATH}/libtbbmalloc.so.2
	```

	Notes:

	- Note that the environment variable `SGLANG_USE_CPU_ENGINE=1`
	is required to enable the SGLang service with the CPU engine.

	- If you encounter code compilation issues during the `sgl-kernel` building process,
	please check your `gcc` and `g++` versions and upgrade them if they are outdated.
	It is recommended to use `gcc-13` and `g++-13` as they have been verified
	in the official Docker container.

	- The system library path is typically located in one of the following directories:
	`~/.local/lib/`, `/usr/local/lib/`, `/usr/local/lib64/`, `/usr/lib/`, `/usr/lib64/`
	and `/usr/lib/x86_64-linux-gnu/`. In the above example commands, `/usr/lib/x86_64-linux-gnu`
	is used. Please adjust the path according to your server configuration.

	- It is recommended to add the following to your `~/.bashrc` file to
	avoid setting these variables every time you open a new terminal:

	```bash
	source .venv/bin/activate
	export SGLANG_USE_CPU_ENGINE=1
	export LD_LIBRARY_PATH=<YOUR-SYSTEM-LIBRARY-FOLDER>
	export LD_PRELOAD=<YOUR-LIBS-PATHS>
	```

	## Launch of the Serving Engine

	Example command to launch SGLang serving:

	```bash
	python -m sglang.launch_server \
	--model <MODEL_ID_OR_PATH> \
	--trust-remote-code \
	--disable-overlap-schedule \
	--device cpu \
	--host 0.0.0.0 \
	--tp 6
	```

	Notes:

	1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.

	2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
	The number of TP specified is how many TP ranks will be used during the execution.
	On a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
	Usually we can get the SNC information (How many available) from the Operating System with e.g. `lscpu` command.

	If the specified TP rank number differs from the total SNC count,
	the system will automatically utilize the first `n` SNCs.
	Note that `n` cannot exceed the total SNC number, doing so will result in an error.

	To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
	For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
	which has 43-43-42 cores on the 3 SNCs of a socket, we should set:

	```bash
	export SGLANG_CPU_OMP_THREADS_BIND="0-39\|43-82\|86-125\|128-167\|171-210\|214-253"
	```

	Please beware that with SGLANG_CPU_OMP_THREADS_BIND set,
	the available memory amounts of the ranks may not be determined in prior.
	You may need to set proper `--max-total-tokens` to avoid the out-of-memory error.

	3. For optimizing decoding with torch.compile, please add the flag `--enable-torch-compile`.
	To specify the maximum batch size when using `torch.compile`, set the flag `--torch-compile-max-bs`.
	For example, `--enable-torch-compile --torch-compile-max-bs 4` means using `torch.compile`
	and setting the maximum batch size to 4. Currently the maximum applicable batch size
	for optimizing with `torch.compile` is 16.

	4. A warmup step is automatically triggered when the service is started.
	The server is ready when you see the log `The server is fired up and ready to roll!`.

	## Benchmarking with Requests

	You can benchmark the performance via the `bench_serving` script.
	Run the command in another terminal. An example command would be:

	```bash
	python -m sglang.bench_serving \
	--dataset-name random \
	--random-input-len 1024 \
	--random-output-len 1024 \
	--num-prompts 1 \
	--request-rate inf \
	--random-range-ratio 1.0
	```

	Detailed parameter descriptions are available via the command:

	```bash
	python -m sglang.bench_serving -h
	```

	Additionally, requests can be formatted using
	[the OpenAI Completions API](https://docs.sglang.io/basic_usage/openai_api_completions.html)
	and sent via the command line (e.g., using `curl`) or through your own scripts.

	## Example Usage Commands

	Large Language Models can range from fewer than 1 billion to several hundred billion parameters.
	Dense models larger than 20B are expected to run on flagship 6th Gen Intel® Xeon® processors
	with dual sockets and a total of 6 sub-NUMA clusters. Dense models of approximately 10B parameters or fewer,
	or MoE (Mixture of Experts) models with fewer than 10B activated parameters, can run on more common
	4th generation or newer Intel® Xeon® processors, or utilize a single socket of the flagship 6th Gen Intel® Xeon® processors.

	### Example: Running DeepSeek-V3.1-Terminus

	An example command to launch service of W8A8_INT8 DeepSeek-V3.1-Terminus on a Xeon® 6980P server:

	```bash
	python -m sglang.launch_server \
	--model IntervitensInc/DeepSeek-V3.1-Terminus-Channel-int8 \
	--trust-remote-code \
	--disable-overlap-schedule \
	--device cpu \
	--quantization w8a8_int8 \
	--host 0.0.0.0 \
	--enable-torch-compile \
	--torch-compile-max-bs 4 \
	--tp 6
	```

	Similarly, an example command to launch service of FP8 DeepSeek-V3.1-Terminus would be:

	```bash
	python -m sglang.launch_server \
	--model deepseek-ai/DeepSeek-V3.1-Terminus \
	--trust-remote-code \
	--disable-overlap-schedule \
	--device cpu \
	--host 0.0.0.0 \
	--enable-torch-compile \
	--torch-compile-max-bs 4 \
	--tp 6
	```

	Note: Please set `--torch-compile-max-bs` to the maximum desired batch size for your deployment,
	which can be up to 16. The value `4` in the examples is illustrative.

	### Example: Running Llama-3.2-3B

	An example command to launch service of Llama-3.2-3B with BF16 precision:

	```bash
	python -m sglang.launch_server \
	--model meta-llama/Llama-3.2-3B-Instruct \
	--trust-remote-code \
	--disable-overlap-schedule \
	--device cpu \
	--host 0.0.0.0 \
	--enable-torch-compile \
	--torch-compile-max-bs 16 \
	--tp 2
	```

	The example command to launch service of W8A8_INT8 version of Llama-3.2-3B:

	```bash
	python -m sglang.launch_server \
	--model RedHatAI/Llama-3.2-3B-quantized.w8a8 \
	--trust-remote-code \
	--disable-overlap-schedule \
	--device cpu \
	--quantization w8a8_int8 \
	--host 0.0.0.0 \
	--enable-torch-compile \
	--torch-compile-max-bs 16 \
	--tp 2
	```

	Note: The `--torch-compile-max-bs` and `--tp` settings are examples that should be adjusted for your setup.
	For instance, use `--tp 3` to utilize 1 socket with 3 sub-NUMA clusters on an Intel® Xeon® 6980P server.

	Once the server have been launched, you can test it using the `bench_serving` command or create
	your own commands or scripts following [the benchmarking example](#benchmarking-with-requests).