| | --- |
| | license: mit |
| | language: |
| | - en |
| | tags: |
| | - math |
| | - llm |
| | - 4bit |
| | - quantize |
| | - gptq |
| | - qwen |
| | - instruction |
| | --- |
| | |
| | # Qwen2.5-Math-7B-Instruct-4bit |
| |
|
| | ## Model Description |
| |
|
| | **Qwen2.5-Math-7B-Instruct-4bit** is a 4-bit quantized version of the [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations). |
| |
|
| | This model is optimized to: |
| |
|
| | - Reduce model size by ~75% compared to the original model |
| | - Reduce GPU memory requirements during inference |
| | - Increase inference speed |
| | - Maintain high accuracy for mathematical tasks |
| |
|
| | ### Model Details |
| |
|
| | - **Developed by:** Community |
| | - **Model type:** Causal Language Model (Quantized) |
| | - **Language(s):** English, Mathematics |
| | - **License:** MIT |
| | - **Finetuned from model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) |
| | - **Quantization method:** GPTQ (W4A16) via LLM Compressor |
| | - **Calibration dataset:** GSM8K (256 samples) |
| |
|
| | ### Model Sources |
| |
|
| | - **Base Model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) |
| | - **Quantization Tool:** [vLLM LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/) |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | This model is designed for direct use in mathematical and reasoning tasks, including: |
| |
|
| | - Solving arithmetic, algebra, and geometry problems |
| | - Mathematical reasoning and proofs |
| | - Analyzing and explaining mathematical concepts |
| | - Educational mathematics support |
| |
|
| | ### Example Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_path = "your-username/qwen2.5-math-7b-instruct-4bit" |
| | tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_path, |
| | device_map="auto", |
| | dtype="float16", |
| | trust_remote_code=True, |
| | low_cpu_mem_usage=False, # Important for compressed models |
| | ) |
| | |
| | # Create prompt |
| | prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n" |
| | |
| | # Generate |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False) |
| | response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| |
|
| | ### Downstream Use |
| |
|
| | This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | This model is NOT designed for: |
| |
|
| | - Generating harmful or inappropriate content |
| | - Use in applications requiring absolute accuracy (such as critical financial calculations) |
| | - Tasks unrelated to mathematics or reasoning |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | ### Limitations |
| |
|
| | - The model has been quantized and may have slightly lower accuracy compared to the original model |
| | - May encounter errors with some complex problems or edge cases |
| | - Model was primarily trained on English data |
| |
|
| | ### Recommendations |
| |
|
| | Users should: |
| |
|
| | - Verify results for important mathematical problems |
| | - Use the original model (full precision) if maximum accuracy is required |
| | - Understand that quantization may affect some tasks |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install transformers torch accelerate |
| | ``` |
| |
|
| | ### Quick Start |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "your-username/qwen2.5-math-7b-instruct-4bit" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | device_map="auto", |
| | dtype="float16", |
| | trust_remote_code=True, |
| | low_cpu_mem_usage=False, |
| | ) |
| | |
| | # Use the model |
| | prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | outputs = model.generate(**inputs, max_new_tokens=200) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Quantization Procedure |
| |
|
| | The model was quantized using: |
| |
|
| | - **Method:** GPTQ (W4A16) |
| | - **Tool:** vLLM LLM Compressor |
| | - **Calibration dataset:** GSM8K (256 samples) |
| | - **Max sequence length:** 2048 tokens |
| | - **Target layers:** All Linear layers except `lm_head` |
| |
|
| | ### Quantization Hyperparameters |
| |
|
| | - **Scheme:** W4A16 (4-bit weights, 16-bit activations) |
| | - **Block size:** 128 |
| | - **Dampening fraction:** 0.01 |
| | - **Calibration samples:** 256 |
| |
|
| | ## Evaluation |
| |
|
| | ### Testing Data |
| |
|
| | The model was evaluated on the GSM8K test set. |
| |
|
| | ### Metrics |
| |
|
| | - **Accuracy:** Measured on GSM8K test set |
| | - **Model size:** ~3.5GB (compared to ~14GB of the original model) |
| | - **Compression ratio:** ~75% reduction |
| | - **Memory usage:** Significantly reduced compared to the original model |
| |
|
| | ### Results |
| |
|
| | The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements. |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Model Architecture |
| |
|
| | - **Base Architecture:** Qwen2.5 (Transformer-based) |
| | - **Parameters:** 7B (quantized to 4-bit) |
| | - **Context Length:** 8192 tokens (original model), 2048 tokens (optimized for quantization) |
| | - **Quantization:** GPTQ W4A16 |
| |
|
| | ### Compute Infrastructure |
| |
|
| | #### Hardware |
| |
|
| | - **Training/Quantization:** NVIDIA RTX 3060 12GB (or equivalent) |
| | - **Minimum Inference:** GPU with at least 8GB VRAM |
| |
|
| | #### Software |
| |
|
| | - **Quantization Tool:** vLLM LLM Compressor |
| | - **Framework:** PyTorch, Transformers |
| | - **Python:** >=3.12 |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | **Base Model:** |
| |
|
| | ```bibtex |
| | @article{qwen2.5, |
| | title={Qwen2.5: A Large Language Model for Mathematics}, |
| | author={Qwen Team}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | **Quantization Method:** |
| |
|
| | ```bibtex |
| | @article{gptq, |
| | title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers}, |
| | author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan}, |
| | journal={arXiv preprint arXiv:2210.17323}, |
| | year={2022} |
| | } |
| | ``` |
| |
|
| | ## Model Card Contact |
| |
|
| | To report issues or ask questions, please open an issue on the repository. |
| |
|
| | ## Acknowledgments |
| |
|
| | - Qwen Team for the original Qwen2.5-Math-7B-Instruct model |
| | - vLLM team for the LLM Compressor tool |
| | - Hugging Face for infrastructure and support |
| |
|