Instructions to use unsloth/DeepSeek-R1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use unsloth/DeepSeek-R1-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="unsloth/DeepSeek-R1-GGUF", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-R1-GGUF", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("unsloth/DeepSeek-R1-GGUF", trust_remote_code=True)

llama-cpp-python

How to use unsloth/DeepSeek-R1-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="unsloth/DeepSeek-R1-GGUF",
	filename="DeepSeek-R1-BF16/DeepSeek-R1.BF16-00001-of-00030.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use unsloth/DeepSeek-R1-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M

Use Docker

docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use unsloth/DeepSeek-R1-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "unsloth/DeepSeek-R1-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/DeepSeek-R1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M

SGLang

How to use unsloth/DeepSeek-R1-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "unsloth/DeepSeek-R1-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/DeepSeek-R1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "unsloth/DeepSeek-R1-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "unsloth/DeepSeek-R1-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use unsloth/DeepSeek-R1-GGUF with Ollama:
```
ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
```

Unsloth Studio new

How to use unsloth/DeepSeek-R1-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/DeepSeek-R1-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for unsloth/DeepSeek-R1-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for unsloth/DeepSeek-R1-GGUF to start chatting

Docker Model Runner
How to use unsloth/DeepSeek-R1-GGUF with Docker Model Runner:
```
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
```

Lemonade

How to use unsloth/DeepSeek-R1-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull unsloth/DeepSeek-R1-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-R1-GGUF-Q4_K_M

List all available models

lemonade list

Perplexity comparsion results (Updated)

#37

by inputout - opened Feb 23, 2025

Discussion

inputout

Feb 23, 2025

•

edited Feb 23, 2025

I had asked myself the question of how the dynamic quants can be classified in terms of accuracy compared to the usual quants.
The question of benchmarks was also repeatedly asked here.
The only metric that was halfway possible on my limited system was the perplexity (which requires only one run per quant).

Settings: -c 1024 -b 1024 (and the four dynamics with cache type q4_0).
The tests are based on a custom textfile to limit the chunks (in addition wiki.test had nan errors in very early chunks).
In some tests there were always nan errors at llama-perplexity so that the test could not generate a finished PPL (llama-perplexity uses its own calculation, not the simple average value of all chunkvalues). Nevertheless, at least 16 out of 40 chunks were always achieved. The first chunks are volatile, but it's the same with wiki.test. That's why I think it's good to make a certain minimum number of chunks.

A mixture of different gguf were tested. Included are all four dynamic qaunts from unsloth and some more. The reference point is the Q5_K_M (higher was not possible with the system). Bartowski mentioned that they're the same source model, so probably it can be compared on this same basis and I threw unsloth and bartowski quants together.

The graph is based on all chunks that worked (at least 16 of 40).
The delta% is based on the average value of the first 16 chunks (achieved by all of them).

Graphically, the quants results broadly clustered into 4 different areas and within each area the quants are close to each other:

UD_IQ1_S(unsloth)
UD_IQ1_M(unsloth)
UD_IQ2_XXS(unsloth), Q2_K(bartowski), UD_Q2_K_XL(unsloth)
IQ3_M(bartowski), IQ4_XS(bartowski), Q4_K_S(bartowski), Q5_K_M(unsloth)

Conclusions for me with regard to the dynamic quants:

UD_IQ2_XXS and UD_Q2_K_XL are very similar. Distances are more likely to UD_IQ1_M and again to UD_IQ1_S.
The two best dynamic quants are in the range of the usual Q2_K quant.
IQ3_M is still a clear step up in quality from UD_Q2_K_XL.

There are also other short tests:

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/21#67af6a33a44a3738ba47e476
Thanks @TobDeBer . His distances among each other look less strong than mine. He also used his own text file (tests with 3 chunks).
reddit: https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/?rdt=62843
Also here it is mentioned that nan errors can happen. Seems to be a general "problem" with Deepseek R1 and llama-perplexity.

Of course, all the results are to be taken with a grain of salt, the metric perplexity is only the metric perplexity :)
But for me it was exciting as a first point of reference in terms of accuracy compared to the usual quants.

Finally, thanks for cooking all the great ggufs @shimmyshimmer @danielhanchen @bartowski and all other chefs on HF!

inputout

Mar 2, 2025

update:

inputout changed discussion title from Perplexity comparsion results to Perplexity comparsion results (Updated) Mar 2, 2025

Panchovix

May 13, 2025

Sorry to necro but do you have the dataset file to test out there? I have tried with wikitext raw on DeepSeekV3 0324 and I just get nans

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment