nanzhang
/

QuantLRM-R1-Llama-70B-3-bit

Text Generation

Pseudo-Quantization

text-generation-inference

Model card Files Files and versions

QuantLRM-R1-Llama-70B-3-bit / README.md

nanzhang's picture

Update README.md

a6b230a verified 2 months ago

|

history blame contribute delete

2.31 kB

	---
	license: apache-2.0
	tags:
	- 3-bit
	- Quantization
	- Pseudo-Quantization
	pipeline_tag: text-generation
	library_name: transformers
	base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
	---

	# QuantLRM-R1-Llama-70B-3-bit

	3-bit quantized `DeepSeek-R1-Distill-Llama-70B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.

	## Model Details

	This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of `vLLM`, which is the recommended way of inference. To obtain the real quantized version, please refer to our [Github repo](https://github.com/psunlpgroup/QuantLRM). We use an existing CUDA kernel to support the inference of 4-bit real quantized models.


	- Developed by: Nan Zhang (njz5124@psu.edu)
	- Model type: 3-bit pseudo-quantized version of `DeepSeek-R1-Distill-Llama-70B`
	- Repository: https://github.com/psunlpgroup/QuantLRM
	- Paper: https://www.arxiv.org/abs/2602.02581


	## Uses


	This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-Distill-Llama-70B`.



	## Calibration Data

	We use the default calibration set of QuantLRM (`mit-han-lab/pile-val-backup`) to obtain this model.


	## Results

	This model achieves 2.12% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on R1-Llama-70B (Table 2 of QuantLRM).


	## Citation


	BibTeX:

	```bibtex
	@misc{zhang2026quantlrmquantizationlargereasoning,
	title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals},
	author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang},
	year={2026},
	eprint={2602.02581},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2602.02581},
	}
	```

	APA:

	```
	Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.
	```


	## Acknowledgement
	* Our quantization pipeline is developed based on AWQ: https://github.com/mit-han-lab/llm-awq/tree/main.