nanzhang's picture
Update README.md
587319d verified
---
license: apache-2.0
library_name: transformers
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
pipeline_tag: text-generation
tags:
- 3-bit
- Quantization
- Pseudo-Quantization
- reasoning
- arxiv:2602.02581
---
# QuantLRM-R1-Qwen-32B-3-bit
3-bit quantized `DeepSeek-R1-Distill-Qwen-32B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.
## Model Details
This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models.
- **Developed by:** Nan Zhang (njz5124@psu.edu)
- **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Qwen-32B`
- **Repository:** https://github.com/psunlpgroup/QuantLRM
- **Paper:** https://www.arxiv.org/abs/2602.02581
## Uses
This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`.
## Calibration Data
We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model.
## Results
This model achieves more than 3% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Qwen-32B (Table 2 of QuantLRM).
## Citation
**BibTeX:**
```bibtex
@misc{zhang2026quantlrmquantizationlargereasoning,
title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals},
author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang},
year={2026},
eprint={2602.02581},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.02581},
}
```
**APA:**
```
Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.
```
## Acknowledgement
* Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main.