Instructions to use inclusionAI/Qwen3-32B-AWorld with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use inclusionAI/Qwen3-32B-AWorld with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="inclusionAI/Qwen3-32B-AWorld")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Qwen3-32B-AWorld")
model = AutoModelForCausalLM.from_pretrained("inclusionAI/Qwen3-32B-AWorld")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use inclusionAI/Qwen3-32B-AWorld with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "inclusionAI/Qwen3-32B-AWorld"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inclusionAI/Qwen3-32B-AWorld",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/inclusionAI/Qwen3-32B-AWorld

SGLang

How to use inclusionAI/Qwen3-32B-AWorld with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "inclusionAI/Qwen3-32B-AWorld" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inclusionAI/Qwen3-32B-AWorld",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "inclusionAI/Qwen3-32B-AWorld" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "inclusionAI/Qwen3-32B-AWorld",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use inclusionAI/Qwen3-32B-AWorld with Docker Model Runner:
```
docker model run hf.co/inclusionAI/Qwen3-32B-AWorld
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3-32B-AWorld

Model Description

Qwen3-32B-AWorld is a large language model fine-tuned from Qwen3-32B, specializing in agent capabilities and proficient tool usage. The model excels at complex agent-based tasks through precise integration with external tools, achieving a pass@1 score on the GAIA benchmark that surpasses GPT-4o and is comparable to DeepSeek-V3.

Quick Start

This guide provides instructions for quickly deploying and running inference with Qwen3-32B-AWorld using vLLM.

Deployment with vLLM

To deploy the model, use the following vllm serve command:

vllm serve inclusionAI/Qwen3-32B-AWorld \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--dtype bfloat16 \
--tensor-parallel-size 8 \
--enable-auto-tool-choice \
--tool-call-parser hermes

Key Configuration:

Deployment Recommendation: We recommend deploying the model on 8 GPUs to enhance concurrency. The tensor-parallel-size argument should be set to the number of GPUs you are using (e.g., 8 in the command above).
Tool Usage Flags: To enable the model's tool-calling capabilities, it is crucial to include the --enable-auto-tool-choice and --tool-call-parser hermes flags. These ensure that the model can correctly process tool calls and parse the results.

Making Inference Calls

When making an inference request, you must include the tools you want the model to use. The format should follow the official OpenAI API specification.

Here is a complete Python example for making an API call to the deployed model using the requests library. This example demonstrates how to query the model with a specific tool.

import requests
import json

# Define the tools available for the model to use
tools = [
    {
      "type": "function",
      "function": {
        "name": "mcp__google-search__search",
        "description": "Perform a web search query",
        "parameters": {
          "type": "object",
          "properties": {
            "query": {
              "description": "Search query",
              "type": "string"
            },
            "num": {
              "description": "Number of results (1-10)",
              "type": "number"
            }
          },
          "required": [
            "query"
          ]
        }
      }
    }
]

# Define the user's prompt
messages = [
    {
        "role": "user", 
        "content": "Search for hangzhou's weather today."
    }
]

# Set generation parameters
temperature = 0.6
top_p = 0.95
top_k = 20
min_p = 0

# Prepare the request payload
data = {
    "messages": messages,
    "tools": tools,
    "temperature": temperature,
    "top_p": top_p,
    "top_k": top_k,
    "min_p": min_p,
}

# The endpoint for the vLLM OpenAI-compatible server
# Replace {your_ip} and {your_port} with the actual IP address and port of your server.
url = "http://{your_ip}:{your_port}/v1/chat/completions"

# Send the POST request
response = requests.post(
    url,
    headers={"Content-Type": "application/json"},
    data=json.dumps(data)
)

# Print the response from the server
print("Status Code:", response.status_code)
print("Response Body:", response.text)

Note:

Remember to replace {your_ip} and {your_port} in the url variable with the actual IP address and port where your vLLM server is running. The default port is typically 8000.