Add metadata, sample usage, and improve model details
Browse filesHi, I'm Niels from the community science team at Hugging Face.
This PR improves the model card by adding key metadata to enhance discoverability and user experience on the Hub:
- `pipeline_tag: text-generation`: Ensures the model appears in relevant searches.
- `library_name: transformers`: Enables the automated "Use in Transformers" code snippet button.
- `base_model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`: Provides clarity on the original model this quantization is based on.
Additionally, I've added a "Sample Usage" section, directly pulling code snippets from the official GitHub repository to help users easily get started with inference. I've also clarified the paper link in the introduction with its full title.
Please review and merge if this looks good!
|
@@ -4,10 +4,14 @@ tags:
|
|
| 4 |
- 3-bit
|
| 5 |
- Quantization
|
| 6 |
- Pseudo-Quantization
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# QuantLRM-R1-Qwen3-8B-3-bit
|
| 9 |
|
| 10 |
-
3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
|
@@ -20,6 +24,7 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
|
|
| 20 |
|
| 21 |
- **Developed by:** Nan Zhang (njz5124@psu.edu)
|
| 22 |
- **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
|
|
|
|
| 23 |
|
| 24 |
### Model Sources
|
| 25 |
|
|
@@ -35,7 +40,34 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
|
|
| 35 |
|
| 36 |
This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Calibration Data
|
| 41 |
|
|
|
|
| 4 |
- 3-bit
|
| 5 |
- Quantization
|
| 6 |
- Pseudo-Quantization
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
library_name: transformers
|
| 9 |
+
base_model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# QuantLRM-R1-Qwen3-8B-3-bit
|
| 13 |
|
| 14 |
+
3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
|
|
|
| 24 |
|
| 25 |
- **Developed by:** Nan Zhang (njz5124@psu.edu)
|
| 26 |
- **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
|
| 27 |
+
- **Base Model:** `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
|
| 28 |
|
| 29 |
### Model Sources
|
| 30 |
|
|
|
|
| 40 |
|
| 41 |
This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
|
| 42 |
|
| 43 |
+
## Sample Usage
|
| 44 |
+
|
| 45 |
+
To use this model, you can follow the steps below from the [QuantLRM GitHub repository](https://github.com/psunlpgroup/QuantLRM).
|
| 46 |
+
|
| 47 |
+
First, compute input channel importance scores:
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
python compare_weight_matrix.py
|
| 51 |
+
python quadratic_mapping.py # supports processing weight updates on GPU
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
Then, run the quantization pipeline to search for the optimal scales:
|
| 55 |
|
| 56 |
+
```bash
|
| 57 |
+
python -m awq.entry --model_path /PATH/TO/LRM \
|
| 58 |
+
--w_bit 3 --q_group_size 128 --run_awq --dump_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
For inference with the pseudo-quantized model using `vLLM`:
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
python -m awq.entry --model_path /PATH/TO/LRM \
|
| 65 |
+
--w_bit 3 --q_group_size 128 \
|
| 66 |
+
--load_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt \
|
| 67 |
+
--q_backend fake --dump_fake models/R1-Qwen3-8B-w3-g128
|
| 68 |
+
|
| 69 |
+
CUDA_VISIBLE_DEVICES=0 python inference_vllm.py
|
| 70 |
+
```
|
| 71 |
|
| 72 |
## Calibration Data
|
| 73 |
|