Update README.md

df352b5 verified about 2 months ago

6.16 kB

	---
	tags:
	- fp4
	- vllm
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	pipeline_tag: text-generation
	license: mit
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
	---

	# DeepSeek-R1-Distill-Qwen-32B-NVFP4

	## Model Overview
	- Model Architecture: DeepSeek-R1-Distill-Qwen-32B
	- Input: Text / Image
	- Output: Text
	- Model Optimizations:
	- Weight quantization: FP4
	- Activation quantization: FP4
	- Release Date: 7/30/25
	- Version: 1.0
	- Model Developers: RedHatAI

	This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B).
	It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.

	### Model Optimizations

	This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1
	This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

	Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor).

	## Deployment

	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
	<details>
	<summary>Model Usage Code</summary>

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4"
	number_gpus = 2

	sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

	tokenizer = AutoTokenizer.from_pretrained(model_id)

	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": "Who are you?"},
	]

	prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

	llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

	outputs = llm.generate(prompts, sampling_params)

	generated_text = outputs[0].outputs[0].text
	print(generated_text)
	```
	</details>

	vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

	## Creation

	This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.

	<details>
	<summary>Model Creation Code</summary>

	```python

	```
	</details>

	## Evaluation

	This model was evaluated on the well-known OpenLLM v1 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval).

	### Accuracy

	<table>
	<thead>
	<tr>
	<th>Category</th>
	<th>Metric</th>
	<th>DeepSeek-R1-Distill-Qwen-32B</th>
	<th>DeepSeek-R1-Distill-Qwen-32B NVFP4</th>
	<th>Recovery</th>
	</tr>
	</thead>
	<tbody>
	<!-- OpenLLM V1 -->
	<tr>
	<td rowspan="7"><b>OpenLLM V1</b></td>
	<td>arc_challenge</td>
	<td>63.48</td>
	<td>62.12</td>
	<td>97.86</td>
	</tr>
	<tr>
	<td>gsm8k</td>
	<td>86.88</td>
	<td>88.32</td>
	<td>101.66</td>
	</tr>
	<tr>
	<td>hellaswag</td>
	<td>83.51</td>
	<td>82.38</td>
	<td>98.65</td>
	</tr>
	<tr>
	<td>mmlu</td>
	<td>80.97</td>
	<td>80.42</td>
	<td>99.32</td>
	</tr>
	<tr>
	<td>truthfulqa_mc2</td>
	<td>56.82</td>
	<td>55.75</td>
	<td>98.12</td>
	</tr>
	<tr>
	<td>winogrande</td>
	<td>75.93</td>
	<td>75.14</td>
	<td>98.96</td>
	</tr>
	<tr>
	<td><b>Average</b></td>
	<td><b>74.60</b></td>
	<td><b>74.02</b></td>
	<td><b>99.23</b></td>
	</tr>
	<!-- Reasoning -->
	<tr>
	<td rowspan="4"><b>Reasoning</b></td>
	<td>AIME24 (0-shot)</td>
	<td>72.41</td>
	<td>62.07</td>
	<td>85.69</td>
	</tr>
	<tr>
	<td>AIME25 (0-shot)</td>
	<td>58.62</td>
	<td>62.07</td>
	<td>105.89</td>
	</tr>
	<tr>
	<td>GPQA (Diamond, 0-shot)</td>
	<td>68.02</td>
	<td>65.48</td>
	<td>96.27</td>
	</tr>
	<tr>
	<td><b>Average</b></td>
	<td><b>66.35</b></td>
	<td><b>63.21</b></td>
	<td><b>95.95</b></td>
	</tr>
	<!-- Coding -->
	<tr>
	<td rowspan="2"><b>Coding</b></td>
	<td>HumanEval_64 pass@2</td>
	<td>90.00</td>
	<td>89.32</td>
	<td>99.24</td>
	</tr>
	</tbody>
	</table>



	### Reproduction

	The results were obtained using the following commands:

	<details>
	<summary>Model Evaluation Commands</summary>

	#### OpenLLM v1
	```
	lm_eval \
	--model vllm \
	--model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
	--apply_chat_template \
	--fewshot_as_multiturn \
	--tasks openllm \
	--batch_size auto
	```


	#### HumanEval_64

	```
	lm_eval \
	--model vllm \
	--model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\
	--apply_chat_template \
	--fewshot_as_multiturn \
	--tasks humaneval_64_instruct \
	--batch_size auto
	```

	#### LightEval
	```
	# --- model_args.yaml ---
	cat > model_args.yaml <<'YAML'
	model_parameters:
	model_name: "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4"
	dtype: auto
	gpu_memory_utilization: 0.9
	tensor_parallel_size: 2
	max_model_length: 40960
	generation_parameters:
	seed: 42
	temperature: 0.6
	top_k: 50
	top_p: 0.95
	min_p: 0.0
	max_new_tokens: 32768
	YAML

	lighteval vllm model_args.yaml \
	"lighteval\|aime24\|0,lighteval\|aime25\|0,lighteval\|gpqa:diamond\|0" \
	--max-samples -1 \
	--output-dir out_dir

	```
	</details>