Transformers documentation

SGLang

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.0.0rc2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

SGLang

SGLang is a low-latency, high-throughput inference engine for large language models (LLMs). It also includes a frontend language for building agentic workflows.

Set model-impl="transformers" to load a Transformers modeling backend.

import sglang as sgl

llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", model-impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])

Pass --model-impl transformers to the sglang.launch_server command for online serving.

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --model-impl transformers \
  --host 0.0.0.0 \
  --port 30000

Setting model-impl="transformers" tells SGLang to skip its native model matching and use the TransformersModel backend instead. PretrainedConfig.from_pretrained() loads the config and AutoModel.config resolves the model class.

During loading, _attn_implementation is set to "sglang". This routes attention calls through SGLang. RadixAttention kernels replace standard attention layers. SGLang’s parallel linear class replaces linear layers to support tensor parallelism. The model benefits from all SGLang optimizations.

Compatible models require _supports_attention_backend=True so SGLang can control attention execution. See the Building a compatible model backend for inference guide for details.

The load_weights function populates the model with weights.

Resources

Update on GitHub