lunovian commited on
Commit
4db9fa1
·
verified ·
1 Parent(s): 1534e64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -6
README.md CHANGED
@@ -1,10 +1,228 @@
1
  ---
2
  license: mit
3
  language:
4
- - en
5
  tags:
6
- - math
7
- - llm
8
- - 4bit
9
- - quantize
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  language:
4
+ - en
5
  tags:
6
+ - math
7
+ - llm
8
+ - 4bit
9
+ - quantize
10
+ - gptq
11
+ - qwen
12
+ - instruction
13
+ ---
14
+
15
+ # Qwen2.5-Math-7B-Instruct-4bit
16
+
17
+ ## Model Description
18
+
19
+ **Qwen2.5-Math-7B-Instruct-4bit** is a 4-bit quantized version of the [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).
20
+
21
+ This model is optimized to:
22
+
23
+ - Reduce model size by ~75% compared to the original model
24
+ - Reduce GPU memory requirements during inference
25
+ - Increase inference speed
26
+ - Maintain high accuracy for mathematical tasks
27
+
28
+ ### Model Details
29
+
30
+ - **Developed by:** Community
31
+ - **Model type:** Causal Language Model (Quantized)
32
+ - **Language(s):** English, Mathematics
33
+ - **License:** MIT
34
+ - **Finetuned from model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
35
+ - **Quantization method:** GPTQ (W4A16) via LLM Compressor
36
+ - **Calibration dataset:** GSM8K (256 samples)
37
+
38
+ ### Model Sources
39
+
40
+ - **Base Model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
41
+ - **Quantization Tool:** [vLLM LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/)
42
+
43
+ ## Uses
44
+
45
+ ### Direct Use
46
+
47
+ This model is designed for direct use in mathematical and reasoning tasks, including:
48
+
49
+ - Solving arithmetic, algebra, and geometry problems
50
+ - Mathematical reasoning and proofs
51
+ - Analyzing and explaining mathematical concepts
52
+ - Educational mathematics support
53
+
54
+ ### Example Usage
55
+
56
+ ```python
57
+ from transformers import AutoModelForCausalLM, AutoTokenizer
58
+
59
+ model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
61
+ model = AutoModelForCausalLM.from_pretrained(
62
+ model_path,
63
+ device_map="auto",
64
+ dtype="float16",
65
+ trust_remote_code=True,
66
+ low_cpu_mem_usage=False, # Important for compressed models
67
+ )
68
+
69
+ # Create prompt
70
+ prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"
71
+
72
+ # Generate
73
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
74
+ outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
75
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
76
+ print(response)
77
+ ```
78
+
79
+ ### Downstream Use
80
+
81
+ This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.
82
+
83
+ ### Out-of-Scope Use
84
+
85
+ This model is NOT designed for:
86
+
87
+ - Generating harmful or inappropriate content
88
+ - Use in applications requiring absolute accuracy (such as critical financial calculations)
89
+ - Tasks unrelated to mathematics or reasoning
90
+
91
+ ## Bias, Risks, and Limitations
92
+
93
+ ### Limitations
94
+
95
+ - The model has been quantized and may have slightly lower accuracy compared to the original model
96
+ - May encounter errors with some complex problems or edge cases
97
+ - Model was primarily trained on English data
98
+
99
+ ### Recommendations
100
+
101
+ Users should:
102
+
103
+ - Verify results for important mathematical problems
104
+ - Use the original model (full precision) if maximum accuracy is required
105
+ - Understand that quantization may affect some tasks
106
+
107
+ ## How to Get Started with the Model
108
+
109
+ ### Installation
110
+
111
+ ```bash
112
+ pip install transformers torch accelerate
113
+ ```
114
+
115
+ ### Quick Start
116
+
117
+ ```python
118
+ from transformers import AutoModelForCausalLM, AutoTokenizer
119
+
120
+ model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
121
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
122
+ model = AutoModelForCausalLM.from_pretrained(
123
+ model_name,
124
+ device_map="auto",
125
+ dtype="float16",
126
+ trust_remote_code=True,
127
+ low_cpu_mem_usage=False,
128
+ )
129
+
130
+ # Use the model
131
+ prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
132
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
133
+ outputs = model.generate(**inputs, max_new_tokens=200)
134
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
135
+ ```
136
+
137
+ ## Training Details
138
+
139
+ ### Quantization Procedure
140
+
141
+ The model was quantized using:
142
+
143
+ - **Method:** GPTQ (W4A16)
144
+ - **Tool:** vLLM LLM Compressor
145
+ - **Calibration dataset:** GSM8K (256 samples)
146
+ - **Max sequence length:** 2048 tokens
147
+ - **Target layers:** All Linear layers except `lm_head`
148
+
149
+ ### Quantization Hyperparameters
150
+
151
+ - **Scheme:** W4A16 (4-bit weights, 16-bit activations)
152
+ - **Block size:** 128
153
+ - **Dampening fraction:** 0.01
154
+ - **Calibration samples:** 256
155
+
156
+ ## Evaluation
157
+
158
+ ### Testing Data
159
+
160
+ The model was evaluated on the GSM8K test set.
161
+
162
+ ### Metrics
163
+
164
+ - **Accuracy:** Measured on GSM8K test set
165
+ - **Model size:** ~3.5GB (compared to ~14GB of the original model)
166
+ - **Compression ratio:** ~75% reduction
167
+ - **Memory usage:** Significantly reduced compared to the original model
168
+
169
+ ### Results
170
+
171
+ The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.
172
+
173
+ ## Technical Specifications
174
+
175
+ ### Model Architecture
176
+
177
+ - **Base Architecture:** Qwen2.5 (Transformer-based)
178
+ - **Parameters:** 7B (quantized to 4-bit)
179
+ - **Context Length:** 8192 tokens (original model), 2048 tokens (optimized for quantization)
180
+ - **Quantization:** GPTQ W4A16
181
+
182
+ ### Compute Infrastructure
183
+
184
+ #### Hardware
185
+
186
+ - **Training/Quantization:** NVIDIA RTX 3060 12GB (or equivalent)
187
+ - **Minimum Inference:** GPU with at least 8GB VRAM
188
+
189
+ #### Software
190
+
191
+ - **Quantization Tool:** vLLM LLM Compressor
192
+ - **Framework:** PyTorch, Transformers
193
+ - **Python:** >=3.12
194
+
195
+ ## Citation
196
+
197
+ If you use this model, please cite:
198
+
199
+ **Base Model:**
200
+
201
+ ```bibtex
202
+ @article{qwen2.5,
203
+ title={Qwen2.5: A Large Language Model for Mathematics},
204
+ author={Qwen Team},
205
+ year={2024}
206
+ }
207
+ ```
208
+
209
+ **Quantization Method:**
210
+
211
+ ```bibtex
212
+ @article{gptq,
213
+ title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
214
+ author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
215
+ journal={arXiv preprint arXiv:2210.17323},
216
+ year={2022}
217
+ }
218
+ ```
219
+
220
+ ## Model Card Contact
221
+
222
+ To report issues or ask questions, please open an issue on the repository.
223
+
224
+ ## Acknowledgments
225
+
226
+ - Qwen Team for the original Qwen2.5-Math-7B-Instruct model
227
+ - vLLM team for the LLM Compressor tool
228
+ - Hugging Face for infrastructure and support