---
license: mit
language:
  - en
tags:
  - math
  - llm
  - 4bit
  - quantize
  - gptq
  - qwen
  - instruction
---

# Qwen2.5-Math-7B-Instruct-4bit

## Model Description

**Qwen2.5-Math-7B-Instruct-4bit** is a 4-bit quantized version of the [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).

This model is optimized to:

- Reduce model size by ~75% compared to the original model
- Reduce GPU memory requirements during inference
- Increase inference speed
- Maintain high accuracy for mathematical tasks

### Model Details

- **Developed by:** Community
- **Model type:** Causal Language Model (Quantized)
- **Language(s):** English, Mathematics
- **License:** MIT
- **Finetuned from model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
- **Quantization method:** GPTQ (W4A16) via LLM Compressor
- **Calibration dataset:** GSM8K (256 samples)

### Model Sources

- **Base Model:** [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
- **Quantization Tool:** [vLLM LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/)

## Uses

### Direct Use

This model is designed for direct use in mathematical and reasoning tasks, including:

- Solving arithmetic, algebra, and geometry problems
- Mathematical reasoning and proofs
- Analyzing and explaining mathematical concepts
- Educational mathematics support

### Example Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,  # Important for compressed models
)

# Create prompt
prompt = "<|im_start|>user\nSolve for x: 3x + 5 = 14<|im_end|>\n<|im_start|>assistant\n"

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Downstream Use

This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.

### Out-of-Scope Use

This model is NOT designed for:

- Generating harmful or inappropriate content
- Use in applications requiring absolute accuracy (such as critical financial calculations)
- Tasks unrelated to mathematics or reasoning

## Bias, Risks, and Limitations

### Limitations

- The model has been quantized and may have slightly lower accuracy compared to the original model
- May encounter errors with some complex problems or edge cases
- Model was primarily trained on English data

### Recommendations

Users should:

- Verify results for important mathematical problems
- Use the original model (full precision) if maximum accuracy is required
- Understand that quantization may affect some tasks

## How to Get Started with the Model

### Installation

```bash
pip install transformers torch accelerate
```

### Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    dtype="float16",
    trust_remote_code=True,
    low_cpu_mem_usage=False,
)

# Use the model
prompt = "<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

### Quantization Procedure

The model was quantized using:

- **Method:** GPTQ (W4A16)
- **Tool:** vLLM LLM Compressor
- **Calibration dataset:** GSM8K (256 samples)
- **Max sequence length:** 2048 tokens
- **Target layers:** All Linear layers except `lm_head`

### Quantization Hyperparameters

- **Scheme:** W4A16 (4-bit weights, 16-bit activations)
- **Block size:** 128
- **Dampening fraction:** 0.01
- **Calibration samples:** 256

## Evaluation

### Testing Data

The model was evaluated on the GSM8K test set.

### Metrics

- **Accuracy:** Measured on GSM8K test set
- **Model size:** ~3.5GB (compared to ~14GB of the original model)
- **Compression ratio:** ~75% reduction
- **Memory usage:** Significantly reduced compared to the original model

### Results

The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.

## Technical Specifications

### Model Architecture

- **Base Architecture:** Qwen2.5 (Transformer-based)
- **Parameters:** 7B (quantized to 4-bit)
- **Context Length:** 8192 tokens (original model), 2048 tokens (optimized for quantization)
- **Quantization:** GPTQ W4A16

### Compute Infrastructure

#### Hardware

- **Training/Quantization:** NVIDIA RTX 3060 12GB (or equivalent)
- **Minimum Inference:** GPU with at least 8GB VRAM

#### Software

- **Quantization Tool:** vLLM LLM Compressor
- **Framework:** PyTorch, Transformers
- **Python:** >=3.12

## Citation

If you use this model, please cite:

**Base Model:**

```bibtex
@article{qwen2.5,
  title={Qwen2.5: A Large Language Model for Mathematics},
  author={Qwen Team},
  year={2024}
}
```

**Quantization Method:**

```bibtex
@article{gptq,
  title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
  author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2210.17323},
  year={2022}
}
```

## Model Card Contact

To report issues or ask questions, please open an issue on the repository.

## Acknowledgments

- Qwen Team for the original Qwen2.5-Math-7B-Instruct model
- vLLM team for the LLM Compressor tool
- Hugging Face for infrastructure and support