---
license: apache-2.0
library_name: transformers
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
pipeline_tag: text-generation
tags:
- 3-bit
- Quantization
- Pseudo-Quantization
- reasoning
- arxiv:2602.02581
---

# QuantLRM-R1-Qwen-32B-3-bit 

3-bit quantized `DeepSeek-R1-Distill-Qwen-32B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.

## Model Details

This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models. 

- **Developed by:** Nan Zhang (njz5124@psu.edu)
- **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Qwen-32B`
- **Repository:** https://github.com/psunlpgroup/QuantLRM
- **Paper:** https://www.arxiv.org/abs/2602.02581


## Uses

This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`.


## Calibration Data

We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model.


## Results

This model achieves more than 3% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Qwen-32B (Table 2 of QuantLRM).


## Citation

**BibTeX:**

```bibtex
@misc{zhang2026quantlrmquantizationlargereasoning,
      title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals}, 
      author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang},
      year={2026},
      eprint={2602.02581},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.02581}, 
}
```

**APA:**

```
Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.
```

## Acknowledgement
* Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main.