Update README.md

587319d verified 29 days ago

2.34 kB

license: apache-2.0
library_name: transformers
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
pipeline_tag: text-generation
tags:
  - 3-bit
  - Quantization
  - Pseudo-Quantization
  - reasoning
  - arxiv:2602.02581

QuantLRM-R1-Qwen-32B-3-bit

3-bit quantized DeepSeek-R1-Distill-Qwen-32B based on QuantLRM, a state-of-the-art quantization method of large reasoning models via fine-tuning signals.

Model Details

This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of vLLM, which is the recommended way of inference. To obtain the real quantized version, please refer to our Github repo. We use an existing CUDA kernel to support the inference of 4-bit real quantized models.

Developed by: Nan Zhang (njz5124@psu.edu)
Model type: 3-bit pseudo-quantized version of DeepSeek-R1-Distill-Qwen-32B
Repository: https://github.com/psunlpgroup/QuantLRM
Paper: https://www.arxiv.org/abs/2602.02581

Uses

This model is designed to be used with vLLM due to its inference optimization. Please use the tokenizer of deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.

Calibration Data

We use the default calibration set of QuantLRM (mit-han-lab/pile-val-backup) to obtain this model.

Results

This model achieves more than 3% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Qwen-32B (Table 2 of QuantLRM).

Citation

BibTeX:

@misc{zhang2026quantlrmquantizationlargereasoning,
      title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals}, 
      author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang},
      year={2026},
      eprint={2602.02581},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.02581}, 
}

APA:

Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.

Acknowledgement

Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main.