nanzhang
/

QuantLRM-R1-Qwen3-8B-3-bit

@@ -4,10 +4,14 @@ tags:
 - 3-bit
 - Quantization
 - Pseudo-Quantization
 ---
 # QuantLRM-R1-Qwen3-8B-3-bit
-3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals
 ## Model Details
@@ -20,6 +24,7 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
 - **Developed by:** Nan Zhang (njz5124@psu.edu)
 - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
 ### Model Sources
@@ -35,7 +40,34 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
 This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
 ## Calibration Data

 - 3-bit
 - Quantization
 - Pseudo-Quantization
+pipeline_tag: text-generation
+library_name: transformers
+base_model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
 ---
 # QuantLRM-R1-Qwen3-8B-3-bit
+3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.
 ## Model Details
 - **Developed by:** Nan Zhang (njz5124@psu.edu)
 - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
+- **Base Model:** `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
 ### Model Sources
 This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
+## Sample Usage
+To use this model, you can follow the steps below from the [QuantLRM GitHub repository](https://github.com/psunlpgroup/QuantLRM).
+First, compute input channel importance scores:
+```bash
+python compare_weight_matrix.py
+python quadratic_mapping.py   # supports processing weight updates on GPU
+```
+Then, run the quantization pipeline to search for the optimal scales:
+```bash
+python -m awq.entry --model_path /PATH/TO/LRM \
+    --w_bit 3 --q_group_size 128 --run_awq --dump_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt
+```
+For inference with the pseudo-quantized model using `vLLM`:
+```bash
+python -m awq.entry --model_path /PATH/TO/LRM \
+    --w_bit 3 --q_group_size 128 \
+    --load_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt \
+    --q_backend fake --dump_fake models/R1-Qwen3-8B-w3-g128
+CUDA_VISIBLE_DEVICES=0 python inference_vllm.py
+```
 ## Calibration Data