Add metadata, sample usage, and improve model details
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -4,10 +4,14 @@ tags:
|
|
| 4 |
- 3-bit
|
| 5 |
- Quantization
|
| 6 |
- Pseudo-Quantization
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# QuantLRM-R1-Qwen3-8B-3-bit
|
| 9 |
|
| 10 |
-
3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
|
@@ -20,6 +24,7 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
|
|
| 20 |
|
| 21 |
- **Developed by:** Nan Zhang (njz5124@psu.edu)
|
| 22 |
- **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
|
|
|
|
| 23 |
|
| 24 |
### Model Sources
|
| 25 |
|
|
@@ -35,7 +40,34 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
|
|
| 35 |
|
| 36 |
This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Calibration Data
|
| 41 |
|
|
|
|
| 4 |
- 3-bit
|
| 5 |
- Quantization
|
| 6 |
- Pseudo-Quantization
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
library_name: transformers
|
| 9 |
+
base_model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# QuantLRM-R1-Qwen3-8B-3-bit
|
| 13 |
|
| 14 |
+
3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
|
|
|
| 24 |
|
| 25 |
- **Developed by:** Nan Zhang (njz5124@psu.edu)
|
| 26 |
- **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
|
| 27 |
+
- **Base Model:** `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
|
| 28 |
|
| 29 |
### Model Sources
|
| 30 |
|
|
|
|
| 40 |
|
| 41 |
This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
|
| 42 |
|
| 43 |
+
## Sample Usage
|
| 44 |
+
|
| 45 |
+
To use this model, you can follow the steps below from the [QuantLRM GitHub repository](https://github.com/psunlpgroup/QuantLRM).
|
| 46 |
+
|
| 47 |
+
First, compute input channel importance scores:
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
python compare_weight_matrix.py
|
| 51 |
+
python quadratic_mapping.py # supports processing weight updates on GPU
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
Then, run the quantization pipeline to search for the optimal scales:
|
| 55 |
|
| 56 |
+
```bash
|
| 57 |
+
python -m awq.entry --model_path /PATH/TO/LRM \
|
| 58 |
+
--w_bit 3 --q_group_size 128 --run_awq --dump_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
For inference with the pseudo-quantized model using `vLLM`:
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
python -m awq.entry --model_path /PATH/TO/LRM \
|
| 65 |
+
--w_bit 3 --q_group_size 128 \
|
| 66 |
+
--load_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt \
|
| 67 |
+
--q_backend fake --dump_fake models/R1-Qwen3-8B-w3-g128
|
| 68 |
+
|
| 69 |
+
CUDA_VISIBLE_DEVICES=0 python inference_vllm.py
|
| 70 |
+
```
|
| 71 |
|
| 72 |
## Calibration Data
|
| 73 |
|