Instructions to use Qwen/Qwen3-Coder-480B-A35B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Qwen/Qwen3-Coder-480B-A35B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Coder-480B-A35B-Instruct") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Coder-480B-A35B-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3-Coder-480B-A35B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Coder-480B-A35B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
- SGLang
How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-Coder-480B-A35B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Coder-480B-A35B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3-Coder-480B-A35B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3-Coder-480B-A35B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Qwen/Qwen3-Coder-480B-A35B-Instruct with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3-Coder-480B-A35B-Instruct
🚀 Evaluation Best Practice !
The new Qwen3 models have arrived! These include the coding model Qwen/Qwen3-Coder-480B-A35B-Instruct and the general-purpose model Qwen/Qwen3-235B-A22B-Instruct-2507. Let's quickly assess the performance of these two models using the EvalScope model evaluation framework.
Installing Dependencies
First, install the EvalScope model evaluation framework:
pip install 'evalscope[app]' -U
pip install bfcl-eval # Install bfcl evaluation dependencies
Evaluating Qwen3-Coder Model’s Tool Calling Abilities
To evaluate the model, we need to access its capabilities through an OpenAI-compatible inference service. Here, we use the API interface provided by DashScope. Note that EvalScope also supports inference evaluation using transformers; for details, refer to the documentation.
Below is the BFCL-v3 benchmark test to evaluate the Coder model’s tool calling abilities. Configuration details are as follows:
import os
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='qwen3-coder-plus',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=os.getenv('DASHSCOPE_API_KEY'),
eval_type='service', # Use API model service
datasets=['bfcl_v3'],
eval_batch_size=10,
dataset_args={
'bfcl_v3': {
'extra_params':{
# The model refuses to use dots ('.') in function names; set this option to automatically convert dots to underscores during evaluation.
'underscore_to_dot': True,
# Whether the model is a function calling model; if true, function-calling-related configs are enabled, otherwise prompt-based bypass is used.
'is_fc_model': True,
}
}
},
generation_config={
'temperature': 0.7,
'top_p': 0.8,
'top_k': 20,
'repetition_penalty': 1.05,
'max_tokens': 65536, # Set max generation length
'parallel_tool_calls': True, # Enable parallel function calls
},
# limit=50, # Limit number of evaluations for quick testing; remove for full evaluation
ignore_errors=True, # Ignore errors; some test cases may be refused by the model
)
run_task(task_cfg=task_cfg)
Output results:
We can see that the model has strong overall tool calling abilities, but there is still significant room for improvement in multi-turn and parallel tool calls.
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==================+===========+=================+=========================+=======+=========+==============+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | live_simple | 257 | 0.8171 | AST_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | live_multiple | 1039 | 0.8085 | AST_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | live_parallel | 16 | 0.375 | AST_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | live_parallel_multiple | 24 | 0.4167 | AST_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | simple | 400 | 0.955 | AST_NON_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | multiple | 200 | 0.945 | AST_NON_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | parallel | 200 | 0.55 | AST_NON_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | parallel_multiple | 200 | 0.56 | AST_NON_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | java | 100 | 0.64 | AST_NON_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | javascript | 50 | 0.82 | AST_NON_LIVE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | multi_turn_base | 200 | 0.43 | MULTI_TURN |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | multi_turn_miss_func | 200 | 0.24 | MULTI_TURN |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | multi_turn_miss_param | 200 | 0.305 | MULTI_TURN |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | multi_turn_long_context | 200 | 0.385 | MULTI_TURN |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | irrelevance | 240 | 0.8458 | RELEVANCE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | live_relevance | 17 | 0.6471 | RELEVANCE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | live_irrelevance | 881 | 0.8343 | RELEVANCE |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
| qwen3-coder-plus | bfcl_v3 | AverageAccuracy | OVERALL | 4424 | 0.7199 | - |
+------------------+-----------+-----------------+-------------------------+-------+---------+--------------+
Evaluating Qwen3-Instruct Model’s Knowledge and Reasoning Abilities
Below, we use the simple_qa and chinese_simpleqa benchmarks to evaluate the model’s knowledge base, with Qwen2.5-72B used to judge answer correctness. We also use AIME25 to evaluate complex reasoning abilities. Configuration details:
import os
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='qwen3-235b-a22b-instruct-2507',
api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
api_key=os.getenv('DASHSCOPE_API_KEY'),
eval_type='service', # Use API model service
datasets=['simple_qa', 'chinese_simpleqa', 'aime25'],
eval_batch_size=10,
generation_config={
'temperature': 0.7,
'top_p': 0.8,
'top_k': 20,
'max_tokens': 16384, # Set max generation length
},
# limit=20, # Limit number of evaluations for quick tests; remove for full evaluation
ignore_errors=True, # Ignore errors; some test cases may be refused by the model
stream=True, # Enable streaming output
judge_model_args={ # Judge model configuration
'model_id': 'qwen2.5-72b-instruct',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': os.getenv('DASHSCOPE_API_KEY'),
'generation_config': {
'temperature': 0.0,
'max_tokens': 4096
}
},
)
run_task(task_cfg=task_cfg)
Output results:
It can be seen that the model demonstrates good reasoning abilities and a high level of knowledge.
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+===============================+==================+==================+======================+=======+=========+=========+
| qwen3-235b-a22b-instruct-2507 | aime25 | AveragePass@1 | AIME2025-I | 15 | 0.6667 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | aime25 | AveragePass@1 | AIME2025-II | 15 | 0.6667 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | aime25 | AveragePass@1 | OVERALL | 30 | 0.6667 | - |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | Chinese Culture | 20 | 0.65 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | Humanities & Social | 20 | 1 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | Engineering & Tech | 20 | 0.8 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | Life, Arts & Culture | 20 | 0.8 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | Society | 20 | 0.9 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | Nature & Science | 20 | 0.8 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | chinese_simpleqa | is_correct | OVERALL | 120 | 0.825 | - |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| ... (similar for is_incorrect and is_not_attempted) ...
| qwen3-235b-a22b-instruct-2507 | simple_qa | is_correct | default | 20 | 0.6 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | simple_qa | is_incorrect | default | 20 | 0.35 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
| qwen3-235b-a22b-instruct-2507 | simple_qa | is_not_attempted | default | 20 | 0.05 | default |
+-------------------------------+------------------+------------------+----------------------+-------+---------+---------+
For more supported benchmarks, please see the documentation.
Result Visualization
EvalScope supports visualizing results so you can see the model’s specific outputs.
Run the following command to launch the Gradio-based visualization interface:
evalscope app
Select the evaluation report and click "Load" to view the model’s output for each question, as well as the overall accuracy:
Summary
This guide introduced how to use the EvalScope framework to evaluate the performance of the two new models: Qwen3-Coder and Qwen3-Instruct. Evaluation included:
Qwen3-Coder Model: Evaluated tool calling abilities using the BFCL-v3 benchmark. The model demonstrated strong overall performance, but there is room for improvement in multi-turn and parallel calls.
Qwen3-Instruct Model: Assessed knowledge and reasoning abilities using the simple_qa, chinese_simpleqa, and AIME25 benchmarks, with outstanding results.
For the complete evaluation process and documentation, please refer to the official EvalScope documentation.

