Instructions to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

SGLang

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
```

Tokenizer config is wrong

#10

by stoshniwal - opened Jan 21, 2025

Discussion

stoshniwal

Jan 21, 2025

•

edited Jan 22, 2025

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/blob/2d78713b01ecefe27a89fafec248a5dfd731396f/tokenizer_config.json#L33

LlamaTokenizerFast -> Qwen2Tokenizer

JaheimLee

Jan 22, 2025

Qwen always uses Qwen2Tokenizer.

stoshniwal

Jan 22, 2025

•

edited Jan 22, 2025

Sorry updated the tokenizer class in the first comment. The current tokenizer config states the tokenizer class as LlamaTokenizerFast.

jsalix

Jan 22, 2025

@bartowski sorry if this is something you were already aware of, could this be causing some of the issues on local usage? I checked and it seems all the Qwen-based distills have the same Llama tokenizer class instead of the Qwen one used on the respective base models

bartowski

Jan 23, 2025

It seeeeems unlikely, just since llama.cpp uses its own tokenizer, however it is possible that the existing conversion code was based on an incorrect tokenizer

But that should still not be a problem with the final result I think

I've seen people have better results with lower temperature and proper prompting

@ngxson any thoughts?

ngxson

Jan 23, 2025

For GGUF the tokenizer is defined by Model class, not Tokenizer class, so it's not important what is the value in tokenizer_config.json

bartowski

Jan 23, 2025

That's what I thought, thanks for confirming!

Fizzarolli

Jan 24, 2025

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/discussions/4
I think all the qwen ones may or may not be completely busted and have the wrong tokenizer config and special tokens (both in lcpp and transforemrs) :/

jamesbraza

Jan 30, 2025

To share, here's a separate reason the tokenizer config is dangerous: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B/discussions/21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment