File size: 6,955 Bytes
42d1399 452210d 42d1399 1348da8 cf2dd79 3d3004e 1348da8 42d1399 16791da 1348da8 ce3a5b4 cf2dd79 ce3a5b4 42d1399 3d3004e 254d14b 42d1399 3d3004e 254d14b 42d1399 254d14b 42d1399 f9845c4 25777d5 e81f1b6 ffb8276 e81f1b6 25777d5 4b2253b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
license: mit
library_name: transformers
---
# DeepSeek-R1-GPTQ-4b-128g-experts
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->
## Model Overview
This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.
Only non-shared experts within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.
Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.
| Models | Experts Quantized | Attention blocks quantized | Size (GB) |
| ------ | --------- | --------- | --------- |
| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | ❌ | ❌ | 671 GB |
| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts) | ✅ | ❌ | 346 GB |
| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) | ✅ | ✅ | 340 GB |
### Evaluation
This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).
Model outputs were generated with the vLLM engine.
For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.
#### OpenLLM Leaderboard V1 tasks
| | Recovery (%) | Average Score | ARC-Challenge<br>acc_norm, 25-shot | GSM8k<br>exact_match, 5-shot | HellaSwag<br>acc_norm, 10-shot | MMLU<br>acc, 5-shot | TruthfulQA<br>mc2, 0-shot | WinoGrande<br>acc, 5-shot |
| ------------------------------------------ | :----------: | :-----------: | :--------------------------------: | :--------------------------: | :----------------------------: | :-----------------: | :-----------------------: | :-----------------------: |
| deepseek/DeepSeek-R1 | 100.00 | 81.04 | 72.53 | 95.91 | 89.30 | 87.22 | 59.28 | 82.00 |
| cognitivecomputations/DeepSeek-R1-AWQ | 100.07 | 81.10 | 73.12 | 95.15 | 89.07 | 86.86 | 60.09 | 82.32 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 99.86 | 80.93 | 72.70 | 95.68 | 89.25 | 86.83 | 58.77 | 82.32 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts <br> **(this model)**| 100.30 | 81.28 | 72.53 | 95.68 | 89.36 | 86.99 | 59.77 | 83.35 |
#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)
| | Recovery (%) | Average Score | AIME 2024<br>pass@1 | MATH-500<br>pass@1 | GPQA Diamond<br>pass@1 |
| -------------------------------------------- | :----------: | :-----------: | :-----------------: | :----------------: | :--------------------: |
| deepseek/DeepSeek-R1 | 100.00 | 82.99 | 78.33 | 97.24 | 73.38 |
| cognitivecomputations/DeepSeek-R1-AWQ | 94.29 | 78.25 | 70.67 | 93.64 | 70.46 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 96.52 | 80.10 | 72.96 | 97.09 | 70.26 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts <br> **(this model)** | 98.81 | 82.00 | 77.00 | 97.08 | 71.92 |
## Reproduction
The results were obtained using the following commands:
`OpenLLM v1`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale-experts
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"
lm_eval \
--model vllm \
--model_args $MODEL_ARGS \
--tasks openllm \
--batch_size auto
```
For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).
`Reasoning tasks`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale-experts
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"
export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
--custom-tasks src/open_r1/evaluate.py \
--use-chat-template \
--output-dir $OUTPUT_DIR
```
Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038
## Performance benchmarking
We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):
| | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
| cognitivecomputations/DeepSeek-R1-AWQ | 1585.45 | 55.41 | 43.06 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts<br> **(this model)** | 1344.68 | 41.49 | 36.33 |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g | 815.19 | 44.65 | 37.88 |
GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.
## Contributors
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA). |