Instructions to use QuixiAI/DeepSeek-R1-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuixiAI/DeepSeek-R1-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("QuixiAI/DeepSeek-R1-AWQ", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use QuixiAI/DeepSeek-R1-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuixiAI/DeepSeek-R1-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
- SGLang
How to use QuixiAI/DeepSeek-R1-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuixiAI/DeepSeek-R1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuixiAI/DeepSeek-R1-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuixiAI/DeepSeek-R1-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuixiAI/DeepSeek-R1-AWQ with Docker Model Runner:
docker model run hf.co/QuixiAI/DeepSeek-R1-AWQ
skips the thinking process
I am facing an issue with the DeepSeek r1 AWQ model deployed using vLLM. In stream mode, the model consistently skips the thinking process and outputs only "\n\n" instead of generating meaningful responses.
Has anyone else encountered this behavior? Any suggestions on how to resolve this?
Which vLLM version are you using, what's your startup command, and what are the GPUs that you're using?
Thanks for your help! 😊
vLLM Version: 0.7.2
Startup Command: python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 32768 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ --enable-reasoning --reasoning-parser deepseek_r1
GPU Configuration: 8 * A800
--enable-reasoning --reasoning-parser deepseek_r1 This will make the streaming output format slightly different, if you don't want to add special support for this, simply remove these 2 flags and it will work.
thanks I'll try it out
Thanks for your help! 😊
vLLM Version: 0.7.2
Startup Command: python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 32768 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ --enable-reasoning --reasoning-parser deepseek_r1
GPU Configuration: 8 * A800
Does A100 support "--kv-cache-dtype fp8_e5m2"?
thanks I'll try it out
have been solved?
thanks I'll try it out
have been solved?
Yeah.
According to the official DeepSeek documentation:
Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "\n\n") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "\n" at the beginning of every output.
The frequency of triggering thinking is now normal.
However, there are still some issues, as the model's outputs often turn into gibberish.
as the model's outputs often turn into gibberish
Reduce temperature and top p.
Closed as the main issue is solved.
@traphix Yes, but it would be slower than H100.
Hello, I encountered this issue: if I don't add the --kv-cache-dtype fp8_e5m2, I need to reduce the max-model-len to 8192 to avoid OOM (Out of Memory) errors when deploying on 8xH20 gpu. Theoretically, it shouldn't be like this, right?
@ShiningMaker Try using the latest dev version by building from source, it contains MLA for AWQ which massively saves VRAM usage.