nielsr HF Staff commited on
Commit
2a48d01
·
verified ·
1 Parent(s): c1d1f5b

Add metadata, sample usage, and improve model details

Browse files

Hi, I'm Niels from the community science team at Hugging Face.

This PR improves the model card by adding key metadata to enhance discoverability and user experience on the Hub:
- `pipeline_tag: text-generation`: Ensures the model appears in relevant searches.
- `library_name: transformers`: Enables the automated "Use in Transformers" code snippet button.
- `base_model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`: Provides clarity on the original model this quantization is based on.

Additionally, I've added a "Sample Usage" section, directly pulling code snippets from the official GitHub repository to help users easily get started with inference. I've also clarified the paper link in the introduction with its full title.

Please review and merge if this looks good!

Files changed (1) hide show
  1. README.md +33 -1
README.md CHANGED
@@ -4,10 +4,14 @@ tags:
4
  - 3-bit
5
  - Quantization
6
  - Pseudo-Quantization
 
 
 
7
  ---
 
8
  # QuantLRM-R1-Qwen3-8B-3-bit
9
 
10
- 3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals
11
 
12
  ## Model Details
13
 
@@ -20,6 +24,7 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
20
 
21
  - **Developed by:** Nan Zhang (njz5124@psu.edu)
22
  - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
 
23
 
24
  ### Model Sources
25
 
@@ -35,7 +40,34 @@ This is the pseudo-quantized model (weights are dequantized back to full-precisi
35
 
36
  This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Calibration Data
41
 
 
4
  - 3-bit
5
  - Quantization
6
  - Pseudo-Quantization
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ base_model: deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
10
  ---
11
+
12
  # QuantLRM-R1-Qwen3-8B-3-bit
13
 
14
+ 3-bit quantized `DeepSeek-R1-0528-Qwen3-8B` based on [QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals](https://www.arxiv.org/abs/2602.02581), a state-of-the-art quantization method of large reasoning models via fine-tuning signals.
15
 
16
  ## Model Details
17
 
 
24
 
25
  - **Developed by:** Nan Zhang (njz5124@psu.edu)
26
  - **Model type:** 3-bit pseudo-quantized version of `DeepSeek-R1-0528-Qwen3-8B`
27
+ - **Base Model:** `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
28
 
29
  ### Model Sources
30
 
 
40
 
41
  This model is designed to be used with `vLLM` due to its inference optimization. Please use the tokenizer of `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`.
42
 
43
+ ## Sample Usage
44
+
45
+ To use this model, you can follow the steps below from the [QuantLRM GitHub repository](https://github.com/psunlpgroup/QuantLRM).
46
+
47
+ First, compute input channel importance scores:
48
+
49
+ ```bash
50
+ python compare_weight_matrix.py
51
+ python quadratic_mapping.py # supports processing weight updates on GPU
52
+ ```
53
+
54
+ Then, run the quantization pipeline to search for the optimal scales:
55
 
56
+ ```bash
57
+ python -m awq.entry --model_path /PATH/TO/LRM \
58
+ --w_bit 3 --q_group_size 128 --run_awq --dump_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt
59
+ ```
60
+
61
+ For inference with the pseudo-quantized model using `vLLM`:
62
+
63
+ ```bash
64
+ python -m awq.entry --model_path /PATH/TO/LRM \
65
+ --w_bit 3 --q_group_size 128 \
66
+ --load_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt \
67
+ --q_backend fake --dump_fake models/R1-Qwen3-8B-w3-g128
68
+
69
+ CUDA_VISIBLE_DEVICES=0 python inference_vllm.py
70
+ ```
71
 
72
  ## Calibration Data
73