SpiridonSunRotator commited on
Commit
ac66d87
·
verified ·
1 Parent(s): 56364b2

Added eval reproduction and description

Browse files
Files changed (1) hide show
  1. README.md +51 -1
README.md CHANGED
@@ -1,7 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
  ### Evaluation
3
 
4
- This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME24, GPQA-Diamond, MATH500) . Model outputs were generated with the vLLM engine.
 
 
 
 
5
 
6
  `OpenLLM v1`
7
  | Model | ArcC | GSM8k | Hellaswag | MMLU | TruthfulQA-mc2 | Winogrande | Average | Recovery |
@@ -16,3 +36,33 @@ This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME2
16
  | deepseek-ai/DeepSeek-R1 | 78.34 | 97.24 | 73.383 | 82.99 | 100.00 |
17
  | cognitivecomputations/DeepSeek-R1-AWQ | 70.67 | 93.64 | 70.456 | 78.25 | 94.29 |
18
  | daslab-testing/DeepSeek-R1-GPTQ-4b-128g (this) | 72.96 | 97.08 | 70.26 | 80.10 | 96.52 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ ---
5
+ # DeepSeek-R1-GPTQ-4b-128g
6
+ <!-- markdownlint-disable first-line-h1 -->
7
+ <!-- markdownlint-disable html -->
8
+ <!-- markdownlint-disable no-duplicate-header -->
9
+
10
+ ## Model Overview
11
+
12
+ This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
13
+
14
+ All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.
15
+
16
+ Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.
17
 
18
  ### Evaluation
19
 
20
+ This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).
21
+
22
+ Model outputs were generated with the vLLM engine.
23
+
24
+ For reasoning tasks we sample 10 solutions for each seed with `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.
25
 
26
  `OpenLLM v1`
27
  | Model | ArcC | GSM8k | Hellaswag | MMLU | TruthfulQA-mc2 | Winogrande | Average | Recovery |
 
36
  | deepseek-ai/DeepSeek-R1 | 78.34 | 97.24 | 73.383 | 82.99 | 100.00 |
37
  | cognitivecomputations/DeepSeek-R1-AWQ | 70.67 | 93.64 | 70.456 | 78.25 | 94.29 |
38
  | daslab-testing/DeepSeek-R1-GPTQ-4b-128g (this) | 72.96 | 97.08 | 70.26 | 80.10 | 96.52 |
39
+
40
+ ## Reproduction
41
+
42
+ The results were obtained using the following commands:
43
+
44
+ `OpenLLM v1`
45
+ ```bash
46
+ MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
47
+ MODEL_ARGS="pretrained=$MODEL,max_model_len=4096,tensor_parallel_size=8,dtype=auto,gpu_memory_utilization=0.80"
48
+
49
+ lm_eval \
50
+ --model vllm \
51
+ --model_args $MODEL_ARGS \
52
+ --tasks openllm \
53
+ --batch_size auto
54
+ ```
55
+
56
+ For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).
57
+
58
+ `Reasoning tasks`
59
+ ```bash
60
+ MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
61
+ MODEL_ARGS="pretrained=$MODEL,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=1,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"
62
+
63
+ TASK=(one of aime24,math_500,gpqa:diamond)
64
+ lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
65
+ --custom-tasks src/open_r1/evaluate.py \
66
+ --use-chat-template \
67
+ --output-dir $OUTPUT_DIR
68
+ ```