Instructions to use webnn/DeepSeek-R1-Distill-ONNX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use webnn/DeepSeek-R1-Distill-ONNX with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="webnn/DeepSeek-R1-Distill-ONNX")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("webnn/DeepSeek-R1-Distill-ONNX", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use webnn/DeepSeek-R1-Distill-ONNX with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "webnn/DeepSeek-R1-Distill-ONNX"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "webnn/DeepSeek-R1-Distill-ONNX",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/webnn/DeepSeek-R1-Distill-ONNX

SGLang

How to use webnn/DeepSeek-R1-Distill-ONNX with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "webnn/DeepSeek-R1-Distill-ONNX" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "webnn/DeepSeek-R1-Distill-ONNX",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "webnn/DeepSeek-R1-Distill-ONNX" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "webnn/DeepSeek-R1-Distill-ONNX",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use webnn/DeepSeek-R1-Distill-ONNX with Docker Model Runner:
```
docker model run hf.co/webnn/DeepSeek-R1-Distill-ONNX
```

captainspock commited on May 7, 2025

Commit

eea99f9

verified ·

1 Parent(s): 0214bbc

Update README.md

Browse files

Files changed (1) hide show

README.md +109 -3

README.md CHANGED Viewed

@@ -1,3 +1,109 @@
----
-license: mit
----

+---
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- ONNX
+- ONNXRuntime
+license: mit
+---
+## DeepSeek-R1-Distill-Qwen ONNX models
+https://huggingface.co/onnxruntime/DeepSeek-R1-Distill-ONNX/resolve/main/deepseek-r1-distill-qwen-1.5B/gpu/gpu-int4-rtn-block-32/
+This repository hosts the optimized versions of [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/) and [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/) to accelerate inference with ONNX Runtime.
+Optimized models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
+To easily get started with the model, you can use our ONNX Runtime Generate() API. See instructions [here](https://github.com/microsoft/onnxruntime/blob/gh-pages/docs/genai/tutorials/deepseek-python.md)
+For CPU:
+```bash
+# Download the model directly using the Hugging Face CLI
+huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/cpu_and_mobile/* --local-dir .
+# Install the CPU package of ONNX Runtime GenAI
+pip install onnxruntime-genai
+# Please adjust the model directory (-m) accordingly
+curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
+python model-chat.py -m /path/to/cpu-int4-rtn-block-32-acc-level-4/ -e cpu --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"
+```
+For CUDA:
+```bash
+# Download the model directly using the Hugging Face CLI
+huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/gpu/* --local-dir .
+# Install the CUDA package of ONNX Runtime GenAI
+pip install onnxruntime-genai-cuda
+# Please adjust the model directory (-m) accordingly
+curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
+python model-chat.py -m /path/to/gpu-int4-rtn-block-32/ -e cuda --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"
+```
+For DirectML:
+```bash
+# Download the model directly using the Hugging Face CLI
+huggingface-cli download onnxruntime/DeepSeek-R1-Distill-ONNX --include deepseek-r1-distill-qwen-1.5B/gpu/* --local-dir .
+# Install the DirectML package of ONNX Runtime GenAI
+pip install onnxruntime-genai-directml
+# Please adjust the model directory (-m) accordingly
+curl -o https://raw.githubusercontent.com/microsoft/onnxruntime-genai/refs/heads/main/examples/python/model-chat.py
+python model-chat.py -m /path/to/gpu-int4-rtn-block-32/ -e dml --chat_template "<|begin▁of▁sentence|><|User|>{input}<|Assistant|>"
+```
+## ONNX Models
+Here are some of the optimized configurations we have added:
+1. ONNX model for CPU and mobile using int4 quantization via RTN.
+2. ONNX model for GPU using int4 quantization via RTN.
+## Performance
+ONNX enables you to run your models on-device across CPU, GPU, NPU. With ONNX, you can run your models on any machine across all silica (Qualcomm, AMD, Intel, Nvidia, etc).
+See the table below for some key benchmarks for Windows GPU and CPU devices that the ONNX models were tested on.
+| **Model** | **Precisionl** | **Device Type** | **Execution Provider** | **Device** | **Token Generation Throughput** | **Speed up vs base model**|
+| :------------: | :------------: | :------------: | :------------: | :------------: | :------------: | :------------:|
+| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | ONNX | fp16 |	CUDA | RTX 4090 | 197.195 |	4X   |
+| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | ONNX | int4 |	CUDA | RTX 4090 | 313.32  |	6.3X |
+| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B | ONNX | int4 |	CPU  | Intel i9 | 11.749  |	1.4x |
+| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B   | ONNX | fp16 |	CUDA | RTX 4090 | 57.316  |	1.3X |
+| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B   | ONNX | int4 |	CUDA | RTX 4090 | 161.00  |	3.7X |
+| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B   | ONNX | int4 |	CPU  | Intel i9 | 3.184   |	20X  |
+CPU build specs:
+- onnxruntime-genai==0.6.0-dev
+- transformers==4.46.2
+- onnxruntime==1.20.01
+CUDA build specs:
+- onnxruntime-genai-cuda==0.6.0-dev
+- transformers==4.46.2
+- onnxruntime-gpu==1.20.1
+## Model Description
+- **Developed by:**  ONNX Runtime
+- **Model type:** ONNX
+- **Language(s) (NLP):** Python, C, C++
+- **License:** MIT
+- **Model Description:** This is a conversion of the Deepseek R1 for ONNX Runtime inference.
+- **Disclaimer:** Model is only an optimization of the base model, any risk associated with the model is the responsibility of the user of the model. Please verify and test for you scenarios. There may be a slight difference in output from the base model with the optimizations applied. **
+## Base Model Information
+See HF links [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/) and [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/) for details.