FasterDFlash
/

Hanrui

Model card Files Files and versions

Hanrui / sglang /docs /basic_usage /qwen3_5.md

Lekr0's picture

Add files using upload-large-folder tool

a227c91 verified 28 days ago

|

history blame contribute delete

2.82 kB

	# Qwen 3.5 Usage

	Qwen 3.5 is Alibaba's latest generation LLM featuring a hybrid attention architecture, advanced MoE with shared experts, and native multimodal capabilities.

	Key architecture features:
	- Hybrid Attention: Gated Delta Networks (linear, O(n) complexity) combined with full attention every 4th layer for high associative recall
	- MoE with Shared Experts: Top-8 active out of 64 routed experts plus a dedicated shared expert for universal features
	- Multimodal: DeepStack Vision Transformer with Conv3d for native image and video understanding

	## Launch Qwen 3.5 with SGLang

	### Dense Model

	To serve `Qwen/Qwen3.5-397B-A17B` on 8 GPUs:

	```bash
	python3 -m sglang.launch_server \
	--model-path Qwen/Qwen3.5-397B-A17B \
	--tp 8 \
	--trust-remote-code
	```

	### AMD GPU (MI300X / MI325X / MI35X)

	On AMD Instinct GPUs, use the `triton` attention backend. Both the full attention layers and the Gated Delta Net (linear attention) layers use Triton-based kernels on ROCm:

	```bash
	SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
	--model-path Qwen/Qwen3.5-397B-A17B \
	--tp 8 \
	--attention-backend triton \
	--trust-remote-code
	```

	```{tip}
	Set `SGLANG_USE_AITER=1` to enable AMD's optimized aiter kernels for MoE and GEMM operations.
	```

	### Configuration Tips

	- `--attention-backend`: Use `triton` on AMD GPUs for Qwen 3.5. The hybrid attention architecture (Gated Delta Networks + full attention) works best with the Triton backend on ROCm. The linear attention (GDN) layers always use Triton kernels internally via the `GDNAttnBackend`.
	- `--watchdog-timeout`: Increase to `1200` or higher for this large model, as weight loading takes significant time.
	- `--model-loader-extra-config '{"enable_multithread_load": true}'`: Enables parallel weight loading for faster startup.

	### Reasoning and Tool Calling

	Qwen 3.5 supports reasoning and tool calling via the Qwen3 parsers:

	```bash
	python3 -m sglang.launch_server \
	--model-path Qwen/Qwen3.5-397B-A17B \
	--tp 8 \
	--trust-remote-code \
	--reasoning-parser qwen3 \
	--tool-call-parser qwen3_coder
	```

	## Accuracy Evaluation

	You can evaluate the model accuracy using `lm-eval`:

	```bash
	pip install lm-eval[api]

	lm_eval --model local-completions \
	--model_args '{"base_url": "http://localhost:8000/v1/completions", "model": "Qwen/Qwen3.5-397B-A17B", "num_concurrent": 256, "max_retries": 10, "max_gen_toks": 2048}' \
	--tasks gsm8k \
	--batch_size auto \
	--num_fewshot 5 \
	--trust_remote_code
	```

	## Additional Resources

	- [AMD Day 0 Support for Qwen 3.5 on AMD Instinct GPUs](https://www.amd.com/en/developer/resources/technical-articles/2026/day-0-support-for-qwen-3-5-on-amd-instinct-gpus.html)
	- [HuggingFace Model Card](https://huggingface.co/Qwen/Qwen3.5-397B-A17B)