SGLang

SGLang is a low-latency, high-throughput inference engine for large language models (LLMs). It also includes a frontend language for building agentic workflows.

Set model_impl="transformers" to load a Transformers modeling backend.

import sglang as sgl

llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", model_impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])

Pass --model-impl transformers to the sglang.launch_server command for online serving.

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --model-impl transformers \
  --host 0.0.0.0 \
  --port 30000

Transformers integration

Setting model_impl="transformers" tells SGLang to skip its native model matching and use the Transformers model directly.

PreTrainedConfig.from_pretrained() loads the model’s config.json from the Hub or your Hugging Face cache.
AutoModel.from_config() resolves the model class based on the config.
During loading, _attn_implementation is set to "sglang". This routes attention calls through SGLang’s RadixAttention kernels.
SGLang’s parallel linear class replaces linear layers to support tensor parallelism.
The load_weights function populates the model with weights from safetensors files.

The model benefits from all SGLang optimizations while using the Transformers model structure.

Compatible models require _supports_attention_backend=True so SGLang can control attention execution. See the Building a compatible model backend for inference guide for details.

Resources

SGLang docs has more usage examples and tips for using Transformers as a backend.
Transformers backend integration in SGLang blog post explains what this integration enables.

Update on GitHub