DeepSeek-R1-GPTQ-4b-128g-experts / README.md

Update README.md

cf2dd79 verified 9 months ago

6.96 kB

	---
	license: mit
	library_name: transformers
	---
	# DeepSeek-R1-GPTQ-4b-128g-experts
	<!-- markdownlint-disable first-line-h1 -->
	<!-- markdownlint-disable html -->
	<!-- markdownlint-disable no-duplicate-header -->

	## Model Overview

	This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

	Only non-shared experts within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

	Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.

	\| Models \| Experts Quantized \| Attention blocks quantized \| Size (GB) \|
	\| ------ \| --------- \| --------- \| --------- \|
	\| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) \| ❌ \| ❌ \| 671 GB \|
	\| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts) \| ✅ \| ❌ \| 346 GB \|
	\| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) \| ✅ \| ✅ \| 340 GB \|

	### Evaluation

	This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).

	Model outputs were generated with the vLLM engine.

	For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.

	#### OpenLLM Leaderboard V1 tasks

	\| \| Recovery (%) \| Average Score \| ARC-Challenge<br>acc_norm, 25-shot \| GSM8k<br>exact_match, 5-shot \| HellaSwag<br>acc_norm, 10-shot \| MMLU<br>acc, 5-shot \| TruthfulQA<br>mc2, 0-shot \| WinoGrande<br>acc, 5-shot \|
	\| ------------------------------------------ \| :----------: \| :-----------: \| :--------------------------------: \| :--------------------------: \| :----------------------------: \| :-----------------: \| :-----------------------: \| :-----------------------: \|
	\| deepseek/DeepSeek-R1 \| 100.00 \| 81.04 \| 72.53 \| 95.91 \| 89.30 \| 87.22 \| 59.28 \| 82.00 \|
	\| cognitivecomputations/DeepSeek-R1-AWQ \| 100.07 \| 81.10 \| 73.12 \| 95.15 \| 89.07 \| 86.86 \| 60.09 \| 82.32 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g \| 99.86 \| 80.93 \| 72.70 \| 95.68 \| 89.25 \| 86.83 \| 58.77 \| 82.32 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts <br> (this model)\| 100.30 \| 81.28 \| 72.53 \| 95.68 \| 89.36 \| 86.99 \| 59.77 \| 83.35 \|

	#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

	\| \| Recovery (%) \| Average Score \| AIME 2024<br>pass@1 \| MATH-500<br>pass@1 \| GPQA Diamond<br>pass@1 \|
	\| -------------------------------------------- \| :----------: \| :-----------: \| :-----------------: \| :----------------: \| :--------------------: \|
	\| deepseek/DeepSeek-R1 \| 100.00 \| 82.99 \| 78.33 \| 97.24 \| 73.38 \|
	\| cognitivecomputations/DeepSeek-R1-AWQ \| 94.29 \| 78.25 \| 70.67 \| 93.64 \| 70.46 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g \| 96.52 \| 80.10 \| 72.96 \| 97.09 \| 70.26 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts <br> (this model) \| 98.81 \| 82.00 \| 77.00 \| 97.08 \| 71.92 \|

	## Reproduction

	The results were obtained using the following commands:

	`OpenLLM v1`
	```bash
	MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale-experts
	MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

	lm_eval \
	--model vllm \
	--model_args $MODEL_ARGS \
	--tasks openllm \
	--batch_size auto
	```

	For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).

	`Reasoning tasks`
	```bash
	MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale-experts
	MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

	export VLLM_WORKER_MULTIPROC_METHOD=spawn
	lighteval vllm $MODEL_ARGS "custom\|aime24\|0\|0,custom\|math_500\|0\|0,custom\|gpqa:diamond\|0\|0" \
	--custom-tasks src/open_r1/evaluate.py \
	--use-chat-template \
	--output-dir $OUTPUT_DIR
	```
	Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

	## Performance benchmarking
	We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

	\| \| Time to First Token<br>Median TTFT (ms) ↓ \| Time per Output Token<br>Median TPOT (ms) ↓ \| Inter-token Latency<br>Median ITL (ms) ↓ \|
	\| -------------------------------------------- \| :-------------------------------------: \| :---------------------------------------: \| :------------------------------------: \|
	\| cognitivecomputations/DeepSeek-R1-AWQ \| 1585.45 \| 55.41 \| 43.06 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts<br> (this model) \| 1344.68 \| 41.49 \| 36.33 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g \| 815.19 \| 44.65 \| 37.88 \|

	GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.


	## Contributors
	Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).