nanzhang
/

QuantLRM-R1-Qwen-32B-3-bit

@@ -1,10 +1,16 @@
 ---
 license: apache-2.0
 tags:
 - 3-bit
 - Quantization
 - Pseudo-Quantization
 ---
 # QuantLRM-R1-Qwen-32B-3-bit
 3-bit quantized `DeepSeek-R1-Distill-Qwen-32B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals
@@ -13,30 +19,17 @@ tags:
 This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models.
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
 - **Developed by:** Nan Zhang (njz5124@psu.edu)
 - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Qwen-32B`
-### Model Sources
-<!-- Provide the basic links for the model. -->
 - **Repository:** https://github.com/psunlpgroup/QuantLRM
 - **Paper:** https://www.arxiv.org/abs/2602.02581
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`.
 ## Calibration Data
 We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model.
@@ -49,8 +42,6 @@ This model achieves more than 3% improvement (based on average scores of various
 ## Citation
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
 ```bibtex
@@ -71,10 +62,6 @@ This model achieves more than 3% improvement (based on average scores of various
 Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.
 ```
-## Model Card Author
-Nan Zhang
-## Model Card Contact
-njz5124@psu.edu

 ---
 license: apache-2.0
+library_name: transformers
+base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
+pipeline_tag: text-generation
 tags:
 - 3-bit
 - Quantization
 - Pseudo-Quantization
+- reasoning
+- arxiv:2602.02581
 ---
 # QuantLRM-R1-Qwen-32B-3-bit
 3-bit quantized `DeepSeek-R1-Distill-Qwen-32B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals
 This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models.
 - **Developed by:** Nan Zhang (njz5124@psu.edu)
 - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Qwen-32B`
 - **Repository:** https://github.com/psunlpgroup/QuantLRM
 - **Paper:** https://www.arxiv.org/abs/2602.02581
 ## Uses
 This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`.
 ## Calibration Data
 We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model.
 ## Citation
 **BibTeX:**
 ```bibtex
 Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.
 ```
+## Acknowledgement
+* Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main.
+* The idea of only searching for the scales of o_proj and down_proj on Olmo3 is based on LLM Compressor: https://github.com/vllm-project/llm-compressor.