Instructions to use Qwen/QwQ-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/QwQ-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/QwQ-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
AMD Developer Cloud
Local Apps Settings

vLLM

How to use Qwen/QwQ-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/QwQ-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/QwQ-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/QwQ-32B

SGLang

How to use Qwen/QwQ-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/QwQ-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/QwQ-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/QwQ-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/QwQ-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Qwen/QwQ-32B with Docker Model Runner:
```
docker model run hf.co/Qwen/QwQ-32B
```

Refining QWQ Model Output: Direct Responses Without Step-by-Step Reasoning

#39

by gslinx - opened Mar 7, 2025

Discussion

gslinx

Mar 7, 2025

The QWQ model demonstrates impressive capabilities, producing highly accurate and relevant results. However, I would like to discuss whether it is possible for the model to generate outputs without displaying its thought process. While transparency in reasoning is valuable in some contexts, there are cases where a direct response without the step-by-step reasoning would be preferable.

Would it be feasible to implement an option that allows users to toggle the visibility of the thought process, depending on their needs?

MrDevolver

Mar 11, 2025

The QWQ model demonstrates impressive capabilities, producing highly accurate and relevant results. However, I would like to discuss whether it is possible for the model to generate outputs without displaying its thought process. While transparency in reasoning is valuable in some contexts, there are cases where a direct response without the step-by-step reasoning would be preferable.

Would it be feasible to implement an option that allows users to toggle the visibility of the thought process, depending on their needs?

So what you're asking for are two different things: Hiding thinking process and removing it entirely.

If you're okay with thinking process being there, the UI such as LM Studio can hide it for you. Alternatively, if you're calling the inference from the code you can cut off the thinking part after inference before showing the output to the user (or doing anything else with that output).

If you don't want the thinking process to be there in the first place, then this model is simply not for you and you may want to use Qwen 2.5 32B instead (the base model this QwQ-32B was built upon). The only difference between them is that QwQ-32B was built to be a thinking model whereas the base Qwen 2.5 32B gives straight answers.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment