FasterDFlash
/

Hanrui

Model card Files Files and versions

Hanrui / sglang /docs /platforms /xpu.md

Lekr0's picture

Add files using upload-large-folder tool

a227c91 verified about 2 months ago

|

history blame contribute delete

3.7 kB

	# XPU

	The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on Intel GPU, [see more context about Intel GPU support within PyTorch ecosystem](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html).

	Specifically, SGLang is optimized for [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/242616/intel-arc-pro-b-series-graphics.html) and [
	Intel® Arc™ B-Series Graphics](https://www.intel.com/content/www/us/en/ark/products/series/240391/intel-arc-b-series-graphics.html).

	## Optimized Model List

	A list of LLMs have been optimized on Intel GPU, and more are on the way:

	\| Model Name \| BF16 \|
	\|:---:\|:---:\|
	\| Llama-3.2-3B \| [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) \|
	\| Llama-3.1-8B \| [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \|
	\| Qwen2.5-1.5B \| [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) \|

	Note: The model identifiers listed in the table above
	have been verified on [Intel® Arc™ B580 Graphics](https://www.intel.com/content/www/us/en/products/sku/241598/intel-arc-b580-graphics/specifications.html).

	## Installation

	### Install From Source

	Currently SGLang XPU only supports installation from source. Please refer to ["Getting Started on Intel GPU"](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) to install XPU dependency.

	```bash
	# Create and activate a conda environment
	conda create -n sgl-xpu python=3.12 -y
	conda activate sgl-xpu

	# Set PyTorch XPU as primary pip install channel to avoid installing the larger CUDA-enabled version and prevent potential runtime issues.
	pip3 install torch==2.9.0+xpu torchao torchvision torchaudio pytorch-triton-xpu==3.5.0 --index-url https://download.pytorch.org/whl/xpu
	pip3 install xgrammar --no-deps # xgrammar will introduce CUDA-enabled triton which might conflict with XPU

	# Clone the SGLang code
	git clone https://github.com/sgl-project/sglang.git
	cd sglang
	git checkout <YOUR-DESIRED-VERSION>

	# Use dedicated toml file
	cd python
	cp pyproject_xpu.toml pyproject.toml
	# Install SGLang dependent libs, and build SGLang main package
	pip install --upgrade pip setuptools
	pip install -v .
	```

	### Install Using Docker

	The docker for XPU is under active development. Please stay tuned.

	## Launch of the Serving Engine

	Example command to launch SGLang serving:

	```bash
	python -m sglang.launch_server \
	--model <MODEL_ID_OR_PATH> \
	--trust-remote-code \
	--disable-overlap-schedule \
	--device xpu \
	--host 0.0.0.0 \
	--tp 2 \ # using multi GPUs
	--attention-backend intel_xpu \ # using intel optimized XPU attention backend
	--page-size \ # intel_xpu attention backend supports [32, 64, 128]
	```

	## Benchmarking with Requests

	You can benchmark the performance via the `bench_serving` script.
	Run the command in another terminal.

	```bash
	python -m sglang.bench_serving \
	--dataset-name random \
	--random-input-len 1024 \
	--random-output-len 1024 \
	--num-prompts 1 \
	--request-rate inf \
	--random-range-ratio 1.0
	```

	The detail explanations of the parameters can be looked up by the command:

	```bash
	python -m sglang.bench_serving -h
	```

	Additionally, the requests can be formed with
	[OpenAI Completions API](https://docs.sglang.io/basic_usage/openai_api_completions.html)
	and sent via the command line (e.g. using `curl`) or via your own script.