Instructions to use Qwen/Qwen3-Coder-480B-A35B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen3-Coder-480B-A35B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-480B-A35B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Coder-480B-A35B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3-Coder-480B-A35B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Coder-480B-A35B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3-Coder-480B-A35B-Instruct

SGLang

How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3-Coder-480B-A35B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Coder-480B-A35B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3-Coder-480B-A35B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3-Coder-480B-A35B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
```

Any chance of a smaller coding model in the 30-70b range?

by smcleod - opened Jul 23, 2025

Discussion

smcleod

Jul 23, 2025

Howdy fine Qwen folks, congratulations on the release!

I'm wondering if you have any plans to release a Qwen 3 coder model in a size range that could be run locally / on home servers?

Something in the 30-70b range would be great, perhaps even a (40b-70b)-A(4-8)b MoE?

paulorodriguesjr

Jul 23, 2025

30B A3B Coder or A6B would be amazing!

solarkyle

Jul 23, 2025

Howdy fine Qwen folks, congratulations on the release!

I'm wondering if you have any plans to release a Qwen 3 coder model in a size range that could be run locally / on home servers?

Something in the 30-70b range would be great, perhaps even a (40b-70b)-A(4-8)b MoE?

This would be perfect!

mirekphd

Jul 24, 2025

•

edited Jul 24, 2025

Something in the 30-70b range would be great, perhaps even a (40b-70b)-A(4-8)b MoE

Let's not forget the enormous context window of this new full-size Coder model (256k tok. native, 1M expanded). I think anything larger than perhaps 50B (or even less?) would have problems fitting into a single A100 or H100 card (Qwen2.5 72B in 8-bit quants definitely could not fit into a single 80G A100 card with its 131K context - 8k tok. was max that fit there, so you had to spill over 30/80 of model layers to the CPU to fit the 131K window, which made inference down by an order of magnitude...).

I also think that using MoE with smaller models is not a good idea (expert model size matters, so you are better of with a dense variant until a single expert in a MoE version can match the size of the dense model).

smcleod

Jul 25, 2025

•

edited Jul 25, 2025

Let's not forget the enormous context window of this new full-size Coder model .. think anything larger than perhaps 50B (or even less?) would have problems fitting into a single A100

With K/V cache quantisation this is less of an issue, you can now run up very large context windows with llama.cpp with the KV cache set to Q8_0 with practically no quality drop but at a fraction of the overheads of full fp16/32.

You can run up 30b~ models on two RTX3090s (48GB) with a 256k context easily.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment