--- license: apache-2.0 library_name: transformers base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B pipeline_tag: text-generation tags: - 3-bit - Quantization - Pseudo-Quantization - reasoning - arxiv:2602.02581 --- # QuantLRM-R1-Qwen-32B-3-bit 3-bit quantized `DeepSeek-R1-Distill-Qwen-32B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals. ## Model Details This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models. - **Developed by:** Nan Zhang (njz5124@psu.edu) - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Qwen-32B` - **Repository:** https://github.com/psunlpgroup/QuantLRM - **Paper:** https://www.arxiv.org/abs/2602.02581 ## Uses This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. ## Calibration Data We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model. ## Results This model achieves more than 3% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Qwen-32B (Table 2 of QuantLRM). ## Citation **BibTeX:** ```bibtex @misc{zhang2026quantlrmquantizationlargereasoning, title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals}, author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang}, year={2026}, eprint={2602.02581}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.02581}, } ``` **APA:** ``` Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581. ``` ## Acknowledgement * Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main.