Instructions to use adamo1139/DeepSeek-R1-0528-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adamo1139/DeepSeek-R1-0528-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="adamo1139/DeepSeek-R1-0528-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("adamo1139/DeepSeek-R1-0528-AWQ", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("adamo1139/DeepSeek-R1-0528-AWQ", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use adamo1139/DeepSeek-R1-0528-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adamo1139/DeepSeek-R1-0528-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamo1139/DeepSeek-R1-0528-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adamo1139/DeepSeek-R1-0528-AWQ

SGLang

How to use adamo1139/DeepSeek-R1-0528-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adamo1139/DeepSeek-R1-0528-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamo1139/DeepSeek-R1-0528-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adamo1139/DeepSeek-R1-0528-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamo1139/DeepSeek-R1-0528-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use adamo1139/DeepSeek-R1-0528-AWQ with Docker Model Runner:
```
docker model run hf.co/adamo1139/DeepSeek-R1-0528-AWQ
```

running in vllm gives error

by GrigoriiA - opened Jun 1, 2025

Discussion

GrigoriiA

Jun 1, 2025

•

edited Jun 1, 2025

Did you actually run it in vLLM? It requires dtype=float16, and still cannot run, gives assertion error about quantization method, I think it means that it's not supported for this model in vLLM yet. vLLM version is 0.8.5.
If you run it - which parameters did you use?
Thanks.
This is the end of the error -

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 123, in __init__
[rank0]:     self.experts = FusedMoE(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 499, in __init__
[rank0]:     assert self.quant_method is not None
[rank0]: AssertionError```

GrigoriiA

Jun 1, 2025

Got it working. If anyone will have this problem, the parameter "quantization" should be "awq_marlin", not "awq".

adamo1139

Owner Jun 2, 2025

Hi.

Yes, I did run it in vLLM 0.9.0.1 as well as 0.8.5 on 8x H100, fresh vLLM install on fresh Ubuntu 22.04. Simple command vllm serve adamo1139/DeepSeek-R1-0528-AWQ --tensor-parallel 8 was enough to make it work as vLLM figures out on it's own to use the awq_marlin kernel presumably also the right dtype. For what it's worth, it loads in fine for me with both --dtype float16 and --dtype bfloat16 What GPUs were you using?

GrigoriiA

Jun 2, 2025

•

edited Jun 4, 2025

I used 4x H200. That's enough memory-wise.
vLLM v0.8.5, tensor_parallel=4, dtype=float16, quantization=awq_marlin. With these parameters it works.
Tried it on runpod.io's serverless, makes no sense to use it at least not with network volumes, because load time is more than 1 minute.

adamo1139

Owner Jun 4, 2025

I'm not able to replicate that - when running vLLM 0.8.5 (vllm serve) on 4x H200 (vast.ai) with tensor parallel 2 and awq_marlin quantization, I get OOM. With --tensor-parallel 4 it works. Are you using it with offline inference or vllm serve? If it's offline inference, can you share the relevant code snippet?

GrigoriiA

Jun 4, 2025

I'm sorry, I noticed and corrected my typo. Tensor parallel was 4 of course.
As I stated in my 2nd message, I got it working. The setup was 4x H200, runpod.io with runpod's vllm docker container of vllm 0.8.5, with --tensor-parallel 4 and awq_marlin.
That setup didn't work with quantization set to awq, and that was my problem. I changed it to awq_marlin, and it worked.
Sorry for any confusion.

adamo1139

Owner Jun 4, 2025

I got confused a bit too and forgot about awq_marlin being the focus of the issue. I updated the readme.

adamo1139 changed discussion status to closed Jun 4, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment