--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: mit base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --- # DeepSeek-R1-Distill-Qwen-32B-NVFP4 ## Model Overview - **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B - **Input:** Text / Image - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Release Date:** 7/30/25 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

Model Usage Code

```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4" number_gpus = 2 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ```

vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.

Model Creation Code

```python ```

## Evaluation This model was evaluated on the well-known OpenLLM v1 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval). ### Accuracy

Category	Metric	DeepSeek-R1-Distill-Qwen-32B	DeepSeek-R1-Distill-Qwen-32B NVFP4	Recovery
OpenLLM V1	arc_challenge	63.48	62.12	97.86
	gsm8k	86.88	88.32	101.66
	hellaswag	83.51	82.38	98.65
	mmlu	80.97	80.42	99.32
	truthfulqa_mc2	56.82	55.75	98.12
	winogrande	75.93	75.14	98.96
	Average	74.60	74.02	99.23
Reasoning	AIME24 (0-shot)	72.41	62.07	85.69
	AIME25 (0-shot)	58.62	62.07	105.89
	GPQA (Diamond, 0-shot)	68.02	65.48	96.27
	Average	66.35	63.21	95.95
Coding	HumanEval_64 pass@2	90.00	89.32	99.24

### Reproduction The results were obtained using the following commands:

Model Evaluation Commands

#### OpenLLM v1 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### HumanEval_64 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_64_instruct \ --batch_size auto ``` #### LightEval ``` # --- model_args.yaml --- cat > model_args.yaml <<'YAML' model_parameters: model_name: "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4" dtype: auto gpu_memory_utilization: 0.9 tensor_parallel_size: 2 max_model_length: 40960 generation_parameters: seed: 42 temperature: 0.6 top_k: 50 top_p: 0.95 min_p: 0.0 max_new_tokens: 32768 YAML lighteval vllm model_args.yaml \ "lighteval|aime24|0,lighteval|aime25|0,lighteval|gpqa:diamond|0" \ --max-samples -1 \ --output-dir out_dir ```