--- tags: - fp4 - vllm language: - en - de - fr - it - pt - hi - es - th pipeline_tag: text-generation license: mit base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --- # DeepSeek-R1-Distill-Qwen-32B-NVFP4 ## Model Overview - **Model Architecture:** DeepSeek-R1-Distill-Qwen-32B - **Input:** Text / Image - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP4 - **Activation quantization:** FP4 - **Release Date:** 7/30/25 - **Version:** 1.0 - **Model Developers:** RedHatAI This model is a quantized version of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B). It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model. ### Model Optimizations This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%. Only the weights of the linear operators within transformers blocks are quantized using [LLM Compressor](https://github.com/vllm-project/llm-compressor). ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
Model Usage Code ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4" number_gpus = 2 sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, {"role": "user", "content": "Who are you?"}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ```
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created by applying [LLM Compressor with calibration samples from neuralmagic/calibration dataset](https://github.com/vllm-project/llm-compressor/blob/main/examples/multimodal_vision/llama4_example.py), as presented in the code snipet below.
Model Creation Code ```python ```
## Evaluation This model was evaluated on the well-known OpenLLM v1 and HumanEval_64 benchmarks using [lm-evaluation-harness](https://github.com/neuralmagic/lm-evaluation-harness). The Reasoning evals were done using [ligheval](https://github.com/neuralmagic/lighteval). ### Accuracy
Category Metric DeepSeek-R1-Distill-Qwen-32B DeepSeek-R1-Distill-Qwen-32B NVFP4 Recovery
OpenLLM V1 arc_challenge 63.48 62.12 97.86
gsm8k 86.88 88.32 101.66
hellaswag 83.51 82.38 98.65
mmlu 80.97 80.42 99.32
truthfulqa_mc2 56.82 55.75 98.12
winogrande 75.93 75.14 98.96
Average 74.60 74.02 99.23
Reasoning AIME24 (0-shot) 72.41 62.07 85.69
AIME25 (0-shot) 58.62 62.07 105.89
GPQA (Diamond, 0-shot) 68.02 65.48 96.27
Average 66.35 63.21 95.95
Coding HumanEval_64 pass@2 90.00 89.32 99.24
### Reproduction The results were obtained using the following commands:
Model Evaluation Commands #### OpenLLM v1 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks openllm \ --batch_size auto ``` #### HumanEval_64 ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4",dtype=auto,max_model_len=4096,tensor_parallel_size=2,enable_chunked_prefill=True,enforce_eager=True\ --apply_chat_template \ --fewshot_as_multiturn \ --tasks humaneval_64_instruct \ --batch_size auto ``` #### LightEval ``` # --- model_args.yaml --- cat > model_args.yaml <<'YAML' model_parameters: model_name: "RedHatAI/DeepSeek-R1-Distill-Qwen-32B-NVFP4" dtype: auto gpu_memory_utilization: 0.9 tensor_parallel_size: 2 max_model_length: 40960 generation_parameters: seed: 42 temperature: 0.6 top_k: 50 top_p: 0.95 min_p: 0.0 max_new_tokens: 32768 YAML lighteval vllm model_args.yaml \ "lighteval|aime24|0,lighteval|aime25|0,lighteval|gpqa:diamond|0" \ --max-samples -1 \ --output-dir out_dir ```