lunovian
/

Qwen2.5-Math-7B-Instruct-4bit

 ---
 license: mit
 language:
+  - en
 tags:
+  - math
+  - llm
+  - 4bit
+  - quantize
+  - gptq
+  - qwen
+  - instruction
+---
+# Qwen2.5-Math-7B-Instruct-4bit
+## Model Description
+**Qwen2.5-Math-7B-Instruct-4bit** is a 4-bit quantized version of the [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).
+This model is optimized to:
+- Reduce model size by ~75% compared to the original model
+- Reduce GPU memory requirements during inference
+- Increase inference speed
+- Maintain high accuracy for mathematical tasks
+### Model Details
+- **Developed by:** Community
+- **Model type:** Causal Language Model (Quantized)
+- **Language(s):** English, Mathematics
+- **License:** MIT
+- **Finetuned from model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
+- **Quantization method:** GPTQ (W4A16) via LLM Compressor
+- **Calibration dataset:** GSM8K (256 samples)
+### Model Sources
+- **Base Model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
+- **Quantization Tool:** [vLLM LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/)
+## Uses
+### Direct Use
+This model is designed for direct use in mathematical and reasoning tasks, including:
+- Solving arithmetic, algebra, and geometry problems
+- Mathematical reasoning and proofs
+- Analyzing and explaining mathematical concepts
+- Educational mathematics support
+### Example Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    device_map="auto",
+    dtype="float16",
+    trust_remote_code=True,
+    low_cpu_mem_usage=False,  # Important for compressed models
+)
+# Create prompt
+prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"
+# Generate
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+### Downstream Use
+This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.
+### Out-of-Scope Use
+This model is NOT designed for:
+- Generating harmful or inappropriate content
+- Use in applications requiring absolute accuracy (such as critical financial calculations)
+- Tasks unrelated to mathematics or reasoning
+## Bias, Risks, and Limitations
+### Limitations
+- The model has been quantized and may have slightly lower accuracy compared to the original model
+- May encounter errors with some complex problems or edge cases
+- Model was primarily trained on English data
+### Recommendations
+Users should:
+- Verify results for important mathematical problems
+- Use the original model (full precision) if maximum accuracy is required
+- Understand that quantization may affect some tasks
+## How to Get Started with the Model
+### Installation
+```bash
+pip install transformers torch accelerate
+```
+### Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    dtype="float16",
+    trust_remote_code=True,
+    low_cpu_mem_usage=False,
+)
+# Use the model
+prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=200)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training Details
+### Quantization Procedure
+The model was quantized using:
+- **Method:** GPTQ (W4A16)
+- **Tool:** vLLM LLM Compressor
+- **Calibration dataset:** GSM8K (256 samples)
+- **Max sequence length:** 2048 tokens
+- **Target layers:** All Linear layers except `lm_head`
+### Quantization Hyperparameters
+- **Scheme:** W4A16 (4-bit weights, 16-bit activations)
+- **Block size:** 128
+- **Dampening fraction:** 0.01
+- **Calibration samples:** 256
+## Evaluation
+### Testing Data
+The model was evaluated on the GSM8K test set.
+### Metrics
+- **Accuracy:** Measured on GSM8K test set
+- **Model size:** ~3.5GB (compared to ~14GB of the original model)
+- **Compression ratio:** ~75% reduction
+- **Memory usage:** Significantly reduced compared to the original model
+### Results
+The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.
+## Technical Specifications
+### Model Architecture
+- **Base Architecture:** Qwen2.5 (Transformer-based)
+- **Parameters:** 7B (quantized to 4-bit)
+- **Context Length:** 8192 tokens (original model), 2048 tokens (optimized for quantization)
+- **Quantization:** GPTQ W4A16
+### Compute Infrastructure
+#### Hardware
+- **Training/Quantization:** NVIDIA RTX 3060 12GB (or equivalent)
+- **Minimum Inference:** GPU with at least 8GB VRAM
+#### Software
+- **Quantization Tool:** vLLM LLM Compressor
+- **Framework:** PyTorch, Transformers
+- **Python:** >=3.12
+## Citation
+If you use this model, please cite:
+**Base Model:**
+```bibtex
+@article{qwen2.5,
+  title={Qwen2.5: A Large Language Model for Mathematics},
+  author={Qwen Team},
+  year={2024}
+}
+```
+**Quantization Method:**
+```bibtex
+@article{gptq,
+  title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
+  author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
+  journal={arXiv preprint arXiv:2210.17323},
+  year={2022}
+}
+```
+## Model Card Contact
+To report issues or ask questions, please open an issue on the repository.
+## Acknowledgments
+- Qwen Team for the original Qwen2.5-Math-7B-Instruct model
+- vLLM team for the LLM Compressor tool
+- Hugging Face for infrastructure and support