--- license: mit language: - en tags: - math - llm - 4bit - quantize - gptq - qwen - instruction --- # Qwen2.5-Math-7B-Instruct-4bit ## Model Description **Qwen2.5-Math-7B-Instruct-4bit** is a 4-bit quantized version of the [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations). This model is optimized to: - Reduce model size by ~75% compared to the original model - Reduce GPU memory requirements during inference - Increase inference speed - Maintain high accuracy for mathematical tasks ### Model Details - **Developed by:** Community - **Model type:** Causal Language Model (Quantized) - **Language(s):** English, Mathematics - **License:** MIT - **Finetuned from model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) - **Quantization method:** GPTQ (W4A16) via LLM Compressor - **Calibration dataset:** GSM8K (256 samples) ### Model Sources - **Base Model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) - **Quantization Tool:** [vLLM LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/) ## Uses ### Direct Use This model is designed for direct use in mathematical and reasoning tasks, including: - Solving arithmetic, algebra, and geometry problems - Mathematical reasoning and proofs - Analyzing and explaining mathematical concepts - Educational mathematics support ### Example Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_path = "your-username/qwen2.5-math-7b-instruct-4bit" tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", dtype="float16", trust_remote_code=True, low_cpu_mem_usage=False, # Important for compressed models ) # Create prompt prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n" # Generate inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Downstream Use This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications. ### Out-of-Scope Use This model is NOT designed for: - Generating harmful or inappropriate content - Use in applications requiring absolute accuracy (such as critical financial calculations) - Tasks unrelated to mathematics or reasoning ## Bias, Risks, and Limitations ### Limitations - The model has been quantized and may have slightly lower accuracy compared to the original model - May encounter errors with some complex problems or edge cases - Model was primarily trained on English data ### Recommendations Users should: - Verify results for important mathematical problems - Use the original model (full precision) if maximum accuracy is required - Understand that quantization may affect some tasks ## How to Get Started with the Model ### Installation ```bash pip install transformers torch accelerate ``` ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "your-username/qwen2.5-math-7b-instruct-4bit" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", dtype="float16", trust_remote_code=True, low_cpu_mem_usage=False, ) # Use the model prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details ### Quantization Procedure The model was quantized using: - **Method:** GPTQ (W4A16) - **Tool:** vLLM LLM Compressor - **Calibration dataset:** GSM8K (256 samples) - **Max sequence length:** 2048 tokens - **Target layers:** All Linear layers except `lm_head` ### Quantization Hyperparameters - **Scheme:** W4A16 (4-bit weights, 16-bit activations) - **Block size:** 128 - **Dampening fraction:** 0.01 - **Calibration samples:** 256 ## Evaluation ### Testing Data The model was evaluated on the GSM8K test set. ### Metrics - **Accuracy:** Measured on GSM8K test set - **Model size:** ~3.5GB (compared to ~14GB of the original model) - **Compression ratio:** ~75% reduction - **Memory usage:** Significantly reduced compared to the original model ### Results The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements. ## Technical Specifications ### Model Architecture - **Base Architecture:** Qwen2.5 (Transformer-based) - **Parameters:** 7B (quantized to 4-bit) - **Context Length:** 8192 tokens (original model), 2048 tokens (optimized for quantization) - **Quantization:** GPTQ W4A16 ### Compute Infrastructure #### Hardware - **Training/Quantization:** NVIDIA RTX 3060 12GB (or equivalent) - **Minimum Inference:** GPU with at least 8GB VRAM #### Software - **Quantization Tool:** vLLM LLM Compressor - **Framework:** PyTorch, Transformers - **Python:** >=3.12 ## Citation If you use this model, please cite: **Base Model:** ```bibtex @article{qwen2.5, title={Qwen2.5: A Large Language Model for Mathematics}, author={Qwen Team}, year={2024} } ``` **Quantization Method:** ```bibtex @article{gptq, title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers}, author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan}, journal={arXiv preprint arXiv:2210.17323}, year={2022} } ``` ## Model Card Contact To report issues or ask questions, please open an issue on the repository. ## Acknowledgments - Qwen Team for the original Qwen2.5-Math-7B-Instruct model - vLLM team for the LLM Compressor tool - Hugging Face for infrastructure and support