Text Generation
Transformers
Safetensors
deepseek_v3
conversational
custom_code
text-generation-inference
compressed-tensors
Instructions to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
- SGLang
How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with Docker Model Runner:
docker model run hf.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
File size: 6,901 Bytes
ac66d87 76f1e4e ac66d87 1257709 aa91cc7 f16877c 72c64b3 f16877c 1257709 ac66d87 3c4dc4f 1257709 da7c4a6 dedec7d da7c4a6 ac66d87 dcd41a6 84870dc ac66d87 dcd41a6 84870dc ac66d87 84870dc ac66d87 728319b a5d9330 c707060 f043c47 c707060 a5d9330 7a94020 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | ---
license: mit
library_name: transformers
---
# DeepSeek-R1-GPTQ-4b-128g
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->
## Model Overview
This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.
All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.
Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.
| Models | Experts Quantized | Attention blocks quantized | Size (GB) |
| ------ | --------- | --------- | --------- |
| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | ❌ | ❌ | 671 GB |
| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g) | ✅ | ✅ | 325 GB |
| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) | ✅ | ✅ | 340 GB |
### Evaluation
This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).
Model outputs were generated with the vLLM engine.
For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.
#### OpenLLM Leaderboard V1 tasks
| | Recovery (%) | Average Score | ARC-Challenge<br>acc_norm, 25-shot | GSM8k<br>exact_match, 5-shot | HellaSwag<br>acc_norm, 10-shot | MMLU<br>acc, 5-shot | TruthfulQA<br>mc2, 0-shot | WinoGrande<br>acc, 5-shot |
| ------------------------------------------ | :----------: | :-----------: | :--------------------------------: | :--------------------------: | :----------------------------: | :-----------------: | :-----------------------: | :-----------------------: |
| deepseek/DeepSeek-R1 | 100.00 | 81.04 | 72.53 | 95.91 | 89.30 | 87.22 | 59.28 | 82.00 |
| cognitivecomputations/DeepSeek-R1-AWQ | 100.07 | 81.10 | 73.12 | 95.15 | 89.07 | 86.86 | 60.09 | 82.32 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)** | 99.86 | 80.93 | 72.70 | 95.68 | 89.25 | 86.83 | 58.77 | 82.32 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 100.30 | 81.28 | 72.53 | 95.68 | 89.36 | 86.99 | 59.77 | 83.35 |
#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)
| | Recovery (%) | Average Score | AIME 2024<br>pass@1 | MATH-500<br>pass@1 | GPQA Diamond<br>pass@1 |
| -------------------------------------------- | :----------: | :-----------: | :-----------------: | :----------------: | :--------------------: |
| deepseek/DeepSeek-R1 | 100.00 | 82.99 | 78.33 | 97.24 | 73.38 |
| cognitivecomputations/DeepSeek-R1-AWQ | 94.29 | 78.25 | 70.67 | 93.64 | 70.46 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)** | 96.52 | 80.10 | 72.96 | 97.09 | 70.26 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 98.81 | 82.00 | 77.00 | 97.08 | 71.92 |
## Reproduction
The results were obtained using the following commands:
`OpenLLM v1`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"
lm_eval \
--model vllm \
--model_args $MODEL_ARGS \
--tasks openllm \
--batch_size auto
```
For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).
`Reasoning tasks`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"
export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR
```
Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
## Performance benchmarking
We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):
| | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
| cognitivecomputations/DeepSeek-R1-AWQ | 1585.45 | 55.41 | 43.06 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68 | 41.49 | 36.33 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)** | 815.19 | 44.65 | 37.88 |
GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
## Contributors
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA). |