| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| | pipeline_tag: text-generation |
| | tags: |
| | - 3-bit |
| | - Quantization |
| | - Pseudo-Quantization |
| | - reasoning |
| | - arxiv:2602.02581 |
| | --- |
| | |
| | # QuantLRM-R1-Qwen-32B-3-bit |
| |
|
| | 3-bit quantized `DeepSeek-R1-Distill-Qwen-32B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals. |
| |
|
| | ## Model Details |
| |
|
| | This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models. |
| |
|
| | - **Developed by:** Nan Zhang (njz5124@psu.edu) |
| | - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Qwen-32B` |
| | - **Repository:** https://github.com/psunlpgroup/QuantLRM |
| | - **Paper:** https://www.arxiv.org/abs/2602.02581 |
| |
|
| |
|
| | ## Uses |
| |
|
| | This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. |
| |
|
| |
|
| | ## Calibration Data |
| |
|
| | We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model. |
| |
|
| |
|
| | ## Results |
| |
|
| | This model achieves more than 3% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Qwen-32B (Table 2 of QuantLRM). |
| |
|
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| |
|
| | ```bibtex |
| | @misc{zhang2026quantlrmquantizationlargereasoning, |
| | title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals}, |
| | author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang}, |
| | year={2026}, |
| | eprint={2602.02581}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2602.02581}, |
| | } |
| | ``` |
| |
|
| | **APA:** |
| |
|
| | ``` |
| | Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581. |
| | ``` |
| |
|
| | ## Acknowledgement |
| | * Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main. |