Instructions to use QuixiAI/DeepSeek-R1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuixiAI/DeepSeek-R1-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuixiAI/DeepSeek-R1-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuixiAI/DeepSeek-R1-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ

SGLang

How to use QuixiAI/DeepSeek-R1-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuixiAI/DeepSeek-R1-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuixiAI/DeepSeek-R1-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuixiAI/DeepSeek-R1-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuixiAI/DeepSeek-R1-AWQ with Docker Model Runner:
```
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
```

when i use vllm v0.7.2 to deploy r1 awq, i got empty content

#10

by bupalinyu - opened Feb 13, 2025

Discussion

bupalinyu

Feb 13, 2025

curl http://localhost:23336/v1/chat/completions
-H "Content-Type: application/json"
-d '{"model": "deepseek-reasoner",
"messages": [
{"role": "user", "content": "你是谁"}
],
"stream":true,
"temperature":1.2
}'
data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

bupalinyu

Feb 13, 2025

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 23333 --max-model-len 60000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.92 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model ${LLM_MODEL_DIR}

Saaiet

Feb 13, 2025

Same errors. And if you set "skip_special_tokens" to false when sampling, you'll find it's not empty content but repeated <|begin_of_sentence|> tokens. If you want to see logprobs, the server would yield an error because of NaN value.
Looking for someone's help...

mgoin

Feb 13, 2025

Please disable kv cache quantization

Saaiet

Feb 14, 2025

Please disable kv cache quantization

tried, but still the same bug

v2ray

Feb 17, 2025

Try build from source.

Eric108

Feb 18, 2025

I use SGLang to deploy r1 awq on 1 node A800*8 and get same empty content for some questions too.
My command is below:
python3 -m sglang.launch_server --host 0.0.0.0 --port 30000 --model-path models/DeepSeek-R1-AWQ --tp 8 --enable-p2p-check --trust-remote-code --dtype float16 --mem-fraction-static 0.9 --served-model-name deepseek-r1-awq --disable-cuda-graph

xueshuai

Feb 18, 2025

so , did anyone delpoy successful?

v2ray

Feb 18, 2025

This might be related to the float16 overflow issue, please try the moe_wna16 kernel with bfloat16.

Excp

Feb 19, 2025

•

edited Feb 19, 2025

i deeploy success in vllm 0.7.2，use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

qinyuenlp

Feb 20, 2025

i deeploy success in vllm 0.7.2，use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

Try to download https://huggingface.co/deepseek-ai/DeepSeek-R1/tokenizer_config.json and replace your DeepSeek-R1-awq/tokenizer_config.json。
If it works, you should face the problem that "model output without the '' label".
In DeepSeek-R1's document，their advice is To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

Eric108

Feb 21, 2025

i deeploy success in vllm 0.7.2，use 2 * 8 A100(40G). But for any chinese query, without thinking and the replay is very simple.

Try to download https://huggingface.co/deepseek-ai/DeepSeek-R1/tokenizer_config.json and replace your DeepSeek-R1-awq/tokenizer_config.json。
If it works, you should face the problem that "model output without the '' label".
In DeepSeek-R1's document，their advice is To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

@v2ray Hi , why is there so big difference between DeepSeek-R1/tokenizer_config.json and DeepSeek-R1-awq/tokenizer_config.json, thanks

xiaolizztg

Feb 21, 2025

目前遇到类似的问题，但是只是偶发性的，在上下文较长时才触发，这是因为权重问题导致的吗，应该如何修复呢，有小伙伴解决了吗。

v2ray

Feb 21, 2025

@Eric108 It's not much difference, I just modified the chat template to include prefill ability. You can ignore the rest of the difference as they don't actually matter, they're just some formatting.

@xiaolizztg You can force it to reason by prefilling <think>\n.

v2ray changed discussion status to closed Feb 21, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment