RedHatAI
/

DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic

@@ -8,29 +8,33 @@ base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
 library_name: transformers
 ---
-# DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic
 ## Model Overview
-- **Model Architecture:** DeepSeek-R1-Distill-Qwen-14B
   - **Input:** Text
   - **Output:** Text
 - **Model Optimizations:**
   - **Weight quantization:** FP8
   - **Activation quantization:** FP8
-- **Release Date:** 2/6/2025
 - **Version:** 1.0
 - **Model Developers:** Neural Magic
 Quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B).
 ### Model Optimizations
-This model was obtained by quantizing the weights and activations to FP8 data type, ready for inference with vLLM.
-This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized.
-## Deployment
-### Use with vLLM
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
@@ -38,11 +42,12 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
-max_model_len, tp_size = 4096, 1
-model_name = "neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
-llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
-sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 messages_list = [
     [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
@@ -64,44 +69,40 @@ This model was created with [llm-compressor](https://github.com/vllm-project/llm
 ```python
-import argparse
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from llmcompressor.modifiers.quantization import QuantizationModifier
 from llmcompressor.transformers import oneshot
 import os
-def main():
-    parser = argparse.ArgumentParser(description='Quantize a transformer model to FP8')
-    parser.add_argument('--model_id', type=str, required=True,
-                        help='The model ID from HuggingFace (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")')
-    parser.add_argument('--save_path', type=str, default='.',
-                        help='Custom path to save the quantized model. If not provided, will use model_name-FP8-dynamic')
-    args = parser.parse_args()
-    # Load model
-    model = AutoModelForCausalLM.from_pretrained(
-        args.model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True,
-    )
-    tokenizer = AutoTokenizer.from_pretrained(args.model_id)
-    # Configure the quantization algorithm and scheme
-    recipe = QuantizationModifier(
-        targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
-    )
-    # Apply quantization
-    oneshot(model=model, recipe=recipe)
-    save_path = os.path.join(args.save_path, args.model_id.split("/")[1] + "-FP8-dynamic")
-    os.makedirs(save_path, exist_ok=True)
-    # Save to disk in compressed-tensors format
-    model.save_pretrained(save_path)
-    tokenizer.save_pretrained(save_path)
-    print(f"Model and tokenizer saved to: {save_path}")
-if __name__ == "__main__":
-    main()
 ```
 ## Evaluation
@@ -112,7 +113,7 @@ OpenLLM Leaderboard V1:
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
   --tasks openllm \
   --write_out \
   --batch_size auto \
@@ -124,7 +125,7 @@ OpenLLM Leaderboard V2:
 ```
 lm_eval \
   --model vllm \
-  --model_args pretrained="neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic",dtype=auto,add_bos_token=False,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
   --apply_chat_template \
   --fewshot_as_multiturn \
   --tasks leaderboard \
@@ -132,43 +133,131 @@ lm_eval \
   --batch_size auto \
   --output_path output_dir \
   --show_config
 ```
 ### Accuracy
-#### OpenLLM Leaderboard V1 evaluation scores
-| Metric                                   | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B             | neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic |
-|-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
-| ARC-Challenge (Acc-Norm, 25-shot)       |            58.79                 |           58.02                             |
-| GSM8K (Strict-Match, 5-shot)            |            87.04                 |           87.41                             |
-| HellaSwag (Acc-Norm, 10-shot)           |            81.51                 |           81.46                        |
-| MMLU (Acc, 5-shot)                      |            74.46                |            74.63                         |
-| TruthfulQA (MC2, 0-shot)                |            54.77                 |           54.36                        |
-| Winogrande (Acc, 5-shot)                |            69.38                 |           68.98                             |
-| **Average Score**                       | **70.99**                        | **70.81**                                   |
-| **Recovery (%)**                            | **100.00**                       | **99.75**                                   |
-#### OpenLLM Leaderboard V2 evaluation scores
-| Metric                                                   | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B             | neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic |
-|---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
-| IFEval (Inst-and-Prompt Level Strict Acc, 0-shot)       |         43.05                    |               43.69                         |
-| BBH (Acc-Norm, 3-shot)                                  |         47.16                    |               47.92                         |
-| GPQA (Acc-Norm, 0-shot)                                 |         35.07                     |              35.05                           |
-| MUSR (Acc-Norm, 0-shot)                                 |         45.14                     |              44.62                         |
-| MMLU-Pro (Acc, 5-shot)                                  |         34.86                    |               35.04                        |
-| **Average Score**                                       | **41.05**                        | **41.26**                                   |
-| **Recovery (%)**                                            | **100.00**                       | **100.51**                                   |
-#### Coding evaluation scores
-| Metric                                                   | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B             | neuralmagic-ent/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic |
-|---------------------------------------------------------|:---------------------------------:|:-------------------------------------------:|
-| HumanEval pass@1                                         |         78.90                    |             77.20                           |
-| HumanEval pass@10                                        |         89.80                    |             90.40                           |
-| HumanEval+ pass@1                                        |         72.60                    |             72.40                           |
-| HumanEval+ pass@10                                       |         84.90                    |              85.90                          |
-| **Average Score**                                       | **81.55**                        | **81.47**                                   |
-| **Recovery (%)**                                            | **100.00**                       | **99.90**                                   |

 library_name: transformers
 ---
+# DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic
 ## Model Overview
+- **Model Architecture:** Qwen2ForCausalLM
   - **Input:** Text
   - **Output:** Text
 - **Model Optimizations:**
   - **Weight quantization:** FP8
   - **Activation quantization:** FP8
+- **Release Date:** 2/5/2025
 - **Version:** 1.0
 - **Model Developers:** Neural Magic
 Quantized version of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B).
 ### Model Optimizations
+This model was obtained by quantizing the weights and activations of [DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) to FP8 data type.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+Only the weights and activations of the linear operators within transformers blocks are quantized.
+Weights are quantized using a symmetric per-channel scheme, whereas quantizations are quantized using a symmetric per-token scheme.
+[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
+## Use with vLLM
 This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
 from transformers import AutoTokenizer
 from vllm import LLM, SamplingParams
+number_gpus = 1
+model_name = "neuralmagic/DeepSeek-R1-Distill-Qwen-14B-dynamic"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
+sampling_params = SamplingParams(temperature=0.6, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
+llm = LLM(model=model_name, tensor_parallel_size=number_gpus, trust_remote_code=True)
 messages_list = [
     [{"role": "user", "content": "Who are you? Please respond in pirate speak!"}],
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from llmcompressor.modifiers.quantization import QuantizationModifier
 from llmcompressor.transformers import oneshot
 import os
+# Load model
+model_stub = "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
+model_name = model_stub.split("/")[-1]
+model = AutoModelForCausalLM.from_pretrained(
+    model_stub,
+    torch_dtype="auto",
+)
+tokenizer = AutoTokenizer.from_pretrained(model_stub)
+# Configure the quantization algorithm and scheme
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["lm_head"],
+)
+# Apply quantization
+oneshot(
+    model=model,
+    recipe=recipe,
+)
+# Save to disk in compressed-tensors format
+save_path = model_name + "-FP8-dynamic
+model.save_pretrained(save_path)
+tokenizer.save_pretrained(save_path)
+print(f"Model and tokenizer saved to: {save_path}")
 ```
 ## Evaluation
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
   --tasks openllm \
   --write_out \
   --batch_size auto \
 ```
 lm_eval \
   --model vllm \
+  --model_args pretrained="neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic",dtype=auto,max_model_len=4096,tensor_parallel_size=1,enable_chunked_prefill=True \
   --apply_chat_template \
   --fewshot_as_multiturn \
   --tasks leaderboard \
   --batch_size auto \
   --output_path output_dir \
   --show_config
 ```
 ### Accuracy
+<table>
+  <thead>
+    <tr>
+      <th>Category</th>
+      <th>Metric</th>
+      <th>deepseek-ai/DeepSeek-R1-Distill-Qwen-14B</th>
+      <th>neuralmagic/DeepSeek-R1-Distill-Qwen-14B-FP8-dynamic</th>
+      <th>Recovery</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="7"><b>OpenLLM V1</b></td>
+      <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
+      <td>58.79</td>
+      <td>58.02</td>
+      <td>98.7%</td>
+    </tr>
+    <tr>
+      <td>GSM8K (Strict-Match, 5-shot)</td>
+      <td>87.04</td>
+      <td>87.41</td>
+      <td>100.4%</td>
+    </tr>
+    <tr>
+      <td>HellaSwag (Acc-Norm, 10-shot)</td>
+      <td>81.51</td>
+      <td>81.46</td>
+      <td>100.0%</td>
+    </tr>
+    <tr>
+      <td>MMLU (Acc, 5-shot)</td>
+      <td>74.46</td>
+      <td>74.63</td>
+      <td>100.2%</td>
+    </tr>
+    <tr>
+      <td>TruthfulQA (MC2, 0-shot)</td>
+      <td>54.77</td>
+      <td>54.36</td>
+      <td>99.3%</td>
+    </tr>
+    <tr>
+      <td>Winogrande (Acc, 5-shot)</td>
+      <td>69.38</td>
+      <td>68.98</td>
+      <td>99.4%</td>
+    </tr>
+    <tr>
+      <td><b>Average Score</b></td>
+      <td><b>70.99</b></td>
+      <td><b>70.81</b></td>
+      <td><b>99.8%</b></td>
+    </tr>
+    <tr>
+      <td rowspan="7"><b>OpenLLM V2</b></td>
+      <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
+      <td>43.05</td>
+      <td>43.69</td>
+      <td>101.5%</td>
+    </tr>
+    <tr>
+      <td>BBH (Acc-Norm, 3-shot)</td>
+      <td>47.16</td>
+      <td>47.92</td>
+      <td>101.6%</td>
+    </tr>
+    <tr>
+      <td>Math-Hard (Exact-Match, 4-shot)</td>
+      <td>0.00</td>
+      <td>0.00</td>
+      <td>---</td>
+    </tr>
+    <tr>
+      <td>GPQA (Acc-Norm, 0-shot)</td>
+      <td>35.07</td>
+      <td>35.05</td>
+      <td>100.0%</td>
+    </tr>
+    <tr>
+      <td>MUSR (Acc-Norm, 0-shot)</td>
+      <td>45.14</td>
+      <td>44.62</td>
+      <td>98.8%</td>
+    </tr>
+    <tr>
+      <td>MMLU-Pro (Acc, 5-shot)</td>
+      <td>34.86</td>
+      <td>35.04</td>
+      <td>100.5%</td>
+    </tr>
+    <tr>
+      <td><b>Average Score</b></td>
+      <td><b>34.21</b></td>
+      <td><b>34.39</b></td>
+      <td><b>100.5%</b></td>
+    </tr>
+    <tr>
+      <td rowspan="4"><b>Coding</b></td>
+      <td>HumanEval (pass@1)</td>
+      <td>78.90</td>
+      <td>77.20</td>
+      <td><b>97.9%</b></td>
+    </tr>
+    <tr>
+      <td>HumanEval (pass@10)</td>
+      <td>89.80</td>
+      <td>90.40</td>
+      <td>100.7%</td>
+    </tr>
+    <tr>
+      <td>HumanEval+ (pass@10)</td>
+      <td>72.60</td>
+      <td>72.40</td>
+      <td>99.7%</td>
+    </tr>
+    <tr>
+      <td>HumanEval+ (pass@10)</td>
+      <td>84.90</td>
+      <td>85.90</td>
+      <td>101.2%</td>
+    </tr>
+  </tbody>
+</table>