File size: 6,955 Bytes
42d1399
 
 
 
 
 
 
 
 
 
 
452210d
42d1399
 
 
 
1348da8
cf2dd79
3d3004e
 
 
 
 
1348da8
 
42d1399
 
 
 
16791da
1348da8
ce3a5b4
 
 
cf2dd79
ce3a5b4
 
 
 
 
 
 
 
 
 
 
 
 
42d1399
 
 
 
 
 
 
3d3004e
254d14b
42d1399
 
 
 
 
 
 
 
 
 
 
 
3d3004e
254d14b
42d1399
254d14b
 
42d1399
 
 
f9845c4
25777d5
 
e81f1b6
 
 
 
 
 
ffb8276
e81f1b6
 
 
 
 
25777d5
4b2253b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: mit
library_name: transformers
---
# DeepSeek-R1-GPTQ-4b-128g-experts
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

## Model Overview

This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

Only non-shared experts within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.

| Models | Experts Quantized | Attention blocks quantized | Size (GB) |
| ------ |  --------- | --------- | --------- |
| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | ❌ | ❌  | 671 GB |
| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts) | ✅  | ❌  | 346 GB |
| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) | ✅  | ✅  | 340 GB |

### Evaluation

This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500). 

Model outputs were generated with the vLLM engine.

For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.

#### OpenLLM Leaderboard V1 tasks 

|                                              | Recovery (%) | Average Score | ARC-Challenge<br>acc_norm, 25-shot | GSM8k<br>exact_match, 5-shot | HellaSwag<br>acc_norm, 10-shot | MMLU<br>acc, 5-shot | TruthfulQA<br>mc2, 0-shot | WinoGrande<br>acc, 5-shot |
| ------------------------------------------ | :----------: | :-----------: | :--------------------------------: | :--------------------------: | :----------------------------: | :-----------------: | :-----------------------: | :-----------------------: |
| deepseek/DeepSeek-R1                         | 100.00       | 81.04         | 72.53                              | 95.91                        | 89.30                          | 87.22               | 59.28                     | 82.00                     |
| cognitivecomputations/DeepSeek-R1-AWQ        | 100.07       | 81.10         | 73.12                              | 95.15                        | 89.07                          | 86.86               | 60.09                     | 82.32                     |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g          | 99.86        | 80.93         | 72.70                              | 95.68                        | 89.25                          | 86.83               | 58.77                     | 82.32                     |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts <br> **(this model)**| 100.30       | 81.28         | 72.53                              | 95.68                        | 89.36                          | 86.99               | 59.77                     | 83.35                     |

#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

|                                              | Recovery (%) | Average Score | AIME 2024<br>pass@1 | MATH-500<br>pass@1 | GPQA Diamond<br>pass@1 |
| -------------------------------------------- | :----------: | :-----------: | :-----------------: | :----------------: | :--------------------: |
| deepseek/DeepSeek-R1                         | 100.00       | 82.99         | 78.33               | 97.24              | 73.38                  |
| cognitivecomputations/DeepSeek-R1-AWQ        | 94.29        | 78.25         | 70.67               | 93.64              | 70.46                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g         | 96.52        | 80.10         | 72.96               | 97.09              | 70.26                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts <br> **(this model)** | 98.81        | 82.00         | 77.00               | 97.08              | 71.92                  |

## Reproduction

The results were obtained using the following commands:

`OpenLLM v1`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale-experts
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto
```

For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).

`Reasoning tasks`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale-experts
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```
Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

## Performance benchmarking
We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

|                                              | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
| cognitivecomputations/DeepSeek-R1-AWQ        | 1585.45                                 | 55.41                                     | 43.06                                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts<br> **(this model)** | 1344.68                                 | 41.49                                     | 36.33                                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g         | 815.19                                  | 44.65                                     | 37.88                                  |

GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization. 


## Contributors
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).