Add files using upload-large-folder tool

6268841 verified 25 days ago

4.67 kB

	# Llama4 Usage

	[Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance.

	SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).

	Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).

	## Launch Llama 4 with SGLang

	To serve Llama 4 models on 8xH100/H200 GPUs:

	```bash
	python3 -m sglang.launch_server \
	--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
	--tp 8 \
	--context-length 1000000
	```

	### Configuration Tips

	- OOM Mitigation: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\H100 and up to 2.5M on 8\H200. For the Maverick model, we don't need to set context length on 8\H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\H100 and up to 10M on 8\*H200 for the Scout model.

	- Attention Backend Auto-Selection: SGLang automatically selects the optimal attention backend for Llama 4 based on your hardware. You typically don't need to specify `--attention-backend` manually:
	- Blackwell GPUs (B200/GB200): `trtllm_mha`
	- Hopper GPUs (H100/H200): `fa3`
	- AMD GPUs: `aiter`
	- Intel XPU: `intel_xpu`
	- Other platforms: `triton` (fallback)

	To override the auto-selection, explicitly specify `--attention-backend` with one of the supported backends: `fa3`, `aiter`, `triton`, `trtllm_mha`, or `intel_xpu`.

	- Chat Template: Add `--chat-template llama-4` for chat completion tasks.
	- Enable Multi-Modal: Add `--enable-multimodal` for multi-modal capabilities.
	- Enable Hybrid-KVCache: Set `--swa-full-tokens-ratio` to adjust the ratio of SWA layer (for Llama4, it's local attention layer) KV tokens / full layer KV tokens. (default: 0.8, range: 0-1)


	### EAGLE Speculative Decoding
	Description: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding).

	Usage:
	Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
	```
	python3 -m sglang.launch_server \
	--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4 \
	--trust-remote-code \
	--tp 8 \
	--context-length 1000000
	```

	- Note The Llama 4 draft model nvidia/Llama-4-Maverick-17B-128E-Eagle3 can only recognize conversations in chat mode.

	## Benchmarking Results

	### Accuracy Test with `lm_eval`

	The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).

	Benchmark results on MMLU Pro dataset with 8*H100:
	\| \| Llama-4-Scout-17B-16E-Instruct \| Llama-4-Maverick-17B-128E-Instruct \|
	\|--------------------\|--------------------------------\|-------------------------------------\|
	\| Official Benchmark \| 74.3 \| 80.5 \|
	\| SGLang \| 75.2 \| 80.7 \|

	Commands:

	```bash
	# Llama-4-Scout-17B-16E-Instruct model
	python -m sglang.launch_server \
	--model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
	--port 30000 \
	--tp 8 \
	--mem-fraction-static 0.8 \
	--context-length 65536
	lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0

	# Llama-4-Maverick-17B-128E-Instruct
	python -m sglang.launch_server \
	--model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
	--port 30000 \
	--tp 8 \
	--mem-fraction-static 0.8 \
	--context-length 65536
	lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
	```

	Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).