| --- |
| license: apache-2.0 |
| tags: |
| - 3-bit |
| - Quantization |
| - Pseudo-Quantization |
| pipeline_tag: text-generation |
| library_name: transformers |
| base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| --- |
| |
| # QuantLRM-R1-Llama-70B-3-bit |
|
|
| 3-bit quantized `DeepSeek-R1-Distill-Llama-70B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals. |
|
|
| ## Model Details |
|
|
| This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models. |
|
|
|
|
| - **Developed by:** Nan Zhang (njz5124@psu.edu) |
| - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Llama-70B` |
| - **Repository:** https://github.com/psunlpgroup/QuantLRM |
| - **Paper:** https://www.arxiv.org/abs/2602.02581 |
|
|
|
|
| ## Uses |
|
|
|
|
| This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Llama-70B`. |
|
|
|
|
|
|
| ## Calibration Data |
|
|
| We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model. |
|
|
|
|
| ## Results |
|
|
| This model achieves 2.12% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Llama-70B (Table 2 of QuantLRM). |
|
|
|
|
| ## Citation |
|
|
|
|
| **BibTeX:** |
|
|
| ```bibtex |
| @misc{zhang2026quantlrmquantizationlargereasoning, |
| title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals}, |
| author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang}, |
| year={2026}, |
| eprint={2602.02581}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2602.02581}, |
| } |
| ``` |
|
|
| **APA:** |
|
|
| ``` |
| Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581. |
| ``` |
|
|
|
|
| ## Acknowledgement |
| * Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main. |