Instructions to use unsloth/DeepSeek-R1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/DeepSeek-R1-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/DeepSeek-R1-GGUF", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-R1-GGUF", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("unsloth/DeepSeek-R1-GGUF", trust_remote_code=True) - llama-cpp-python
How to use unsloth/DeepSeek-R1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/DeepSeek-R1-GGUF", filename="DeepSeek-R1-BF16/DeepSeek-R1.BF16-00001-of-00030.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use unsloth/DeepSeek-R1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use unsloth/DeepSeek-R1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/DeepSeek-R1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/DeepSeek-R1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- SGLang
How to use unsloth/DeepSeek-R1-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/DeepSeek-R1-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/DeepSeek-R1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/DeepSeek-R1-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/DeepSeek-R1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use unsloth/DeepSeek-R1-GGUF with Ollama:
ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- Unsloth Studio new
How to use unsloth/DeepSeek-R1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/DeepSeek-R1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/DeepSeek-R1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/DeepSeek-R1-GGUF to start chatting
- Docker Model Runner
How to use unsloth/DeepSeek-R1-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- Lemonade
How to use unsloth/DeepSeek-R1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/DeepSeek-R1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.DeepSeek-R1-GGUF-Q4_K_M
List all available models
lemonade list
Perplexity comparsion results (Updated)
I had asked myself the question of how the dynamic quants can be classified in terms of accuracy compared to the usual quants.
The question of benchmarks was also repeatedly asked here.
The only metric that was halfway possible on my limited system was the perplexity (which requires only one run per quant).
Settings: -c 1024 -b 1024 (and the four dynamics with cache type q4_0).
The tests are based on a custom textfile to limit the chunks (in addition wiki.test had nan errors in very early chunks).
In some tests there were always nan errors at llama-perplexity so that the test could not generate a finished PPL (llama-perplexity uses its own calculation, not the simple average value of all chunkvalues). Nevertheless, at least 16 out of 40 chunks were always achieved. The first chunks are volatile, but it's the same with wiki.test. That's why I think it's good to make a certain minimum number of chunks.
A mixture of different gguf were tested. Included are all four dynamic qaunts from unsloth and some more. The reference point is the Q5_K_M (higher was not possible with the system). Bartowski mentioned that they're the same source model, so probably it can be compared on this same basis and I threw unsloth and bartowski quants together.
The graph is based on all chunks that worked (at least 16 of 40).
The delta% is based on the average value of the first 16 chunks (achieved by all of them).
Graphically, the quants results broadly clustered into 4 different areas and within each area the quants are close to each other:
- UD_IQ1_S(unsloth)
- UD_IQ1_M(unsloth)
- UD_IQ2_XXS(unsloth), Q2_K(bartowski), UD_Q2_K_XL(unsloth)
- IQ3_M(bartowski), IQ4_XS(bartowski), Q4_K_S(bartowski), Q5_K_M(unsloth)
Conclusions for me with regard to the dynamic quants:
- UD_IQ2_XXS and UD_Q2_K_XL are very similar. Distances are more likely to UD_IQ1_M and again to UD_IQ1_S.
- The two best dynamic quants are in the range of the usual Q2_K quant.
- IQ3_M is still a clear step up in quality from UD_Q2_K_XL.
There are also other short tests:
- https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/21#67af6a33a44a3738ba47e476
Thanks @TobDeBer . His distances among each other look less strong than mine. He also used his own text file (tests with 3 chunks). - reddit: https://www.reddit.com/r/LocalLLaMA/comments/1idi5cr/i_did_a_very_short_perplexity_test_with_deepseek/?rdt=62843
Also here it is mentioned that nan errors can happen. Seems to be a general "problem" with Deepseek R1 and llama-perplexity.
Of course, all the results are to be taken with a grain of salt, the metric perplexity is only the metric perplexity :)
But for me it was exciting as a first point of reference in terms of accuracy compared to the usual quants.
Finally, thanks for cooking all the great ggufs @shimmyshimmer @danielhanchen @bartowski and all other chefs on HF!
Sorry to necro but do you have the dataset file to test out there? I have tried with wikitext raw on DeepSeekV3 0324 and I just get nans


