File size: 6,901 Bytes
ac66d87
 
 
 
 
 
 
 
 
 
 
76f1e4e
ac66d87
 
 
 
1257709
aa91cc7
f16877c
 
 
72c64b3
f16877c
1257709
 
ac66d87
 
 
 
3c4dc4f
1257709
da7c4a6
 
 
dedec7d
da7c4a6
 
 
 
 
 
 
 
 
 
 
 
 
ac66d87
 
 
 
 
 
 
dcd41a6
84870dc
ac66d87
 
 
 
 
 
 
 
 
 
 
 
dcd41a6
84870dc
ac66d87
84870dc
 
ac66d87
 
 
 
728319b
a5d9330
 
c707060
 
 
 
 
 
 
f043c47
c707060
 
 
a5d9330
7a94020
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: mit
library_name: transformers
---
# DeepSeek-R1-GPTQ-4b-128g
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

## Model Overview

This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.

| Models | Experts Quantized | Attention blocks quantized | Size (GB) |
| ------ |  --------- | --------- | --------- |
| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | ❌ | ❌  | 671 GB |
| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g) | ✅  | ✅  | 325 GB |
| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) | ✅  | ✅  | 340 GB |

### Evaluation

This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500). 

Model outputs were generated with the vLLM engine.

For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.

#### OpenLLM Leaderboard V1 tasks 

|                                              | Recovery (%) | Average Score | ARC-Challenge<br>acc_norm, 25-shot | GSM8k<br>exact_match, 5-shot | HellaSwag<br>acc_norm, 10-shot | MMLU<br>acc, 5-shot | TruthfulQA<br>mc2, 0-shot | WinoGrande<br>acc, 5-shot |
| ------------------------------------------ | :----------: | :-----------: | :--------------------------------: | :--------------------------: | :----------------------------: | :-----------------: | :-----------------------: | :-----------------------: |
| deepseek/DeepSeek-R1                         | 100.00       | 81.04         | 72.53                              | 95.91                        | 89.30                          | 87.22               | 59.28                     | 82.00                     |
| cognitivecomputations/DeepSeek-R1-AWQ        | 100.07       | 81.10         | 73.12                              | 95.15                        | 89.07                          | 86.86               | 60.09                     | 82.32                     |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)**         | 99.86        | 80.93         | 72.70                              | 95.68                        | 89.25                          | 86.83               | 58.77                     | 82.32                     |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 100.30       | 81.28         | 72.53                              | 95.68                        | 89.36                          | 86.99               | 59.77                     | 83.35                     |

#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

|                                              | Recovery (%) | Average Score | AIME 2024<br>pass@1 | MATH-500<br>pass@1 | GPQA Diamond<br>pass@1 |
| -------------------------------------------- | :----------: | :-----------: | :-----------------: | :----------------: | :--------------------: |
| deepseek/DeepSeek-R1                         | 100.00       | 82.99         | 78.33               | 97.24              | 73.38                  |
| cognitivecomputations/DeepSeek-R1-AWQ        | 94.29        | 78.25         | 70.67               | 93.64              | 70.46                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)**        | 96.52        | 80.10         | 72.96               | 97.09              | 70.26                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 98.81        | 82.00         | 77.00               | 97.08              | 71.92                  |

## Reproduction

The results were obtained using the following commands:

`OpenLLM v1`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto
```

For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).

`Reasoning tasks`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

## Performance benchmarking
We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

|                                              | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
| cognitivecomputations/DeepSeek-R1-AWQ        | 1585.45                                 | 55.41                                     | 43.06                                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68                                 | 41.49                                     | 36.33                                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)**        | 815.19                                  | 44.65                                     | 37.88                                  |

GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization. 

## Contributors
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).