Update README.md

4db9fa1 verified 3 months ago

6.14 kB

	---
	license: mit
	language:
	- en
	tags:
	- math
	- llm
	- 4bit
	- quantize
	- gptq
	- qwen
	- instruction
	---

	# Qwen2.5-Math-7B-Instruct-4bit

	## Model Description

	Qwen2.5-Math-7B-Instruct-4bit is a 4-bit quantized version of the [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct) model using GPTQ quantization (W4A16 - 4-bit weights, 16-bit activations).

	This model is optimized to:

	- Reduce model size by ~75% compared to the original model
	- Reduce GPU memory requirements during inference
	- Increase inference speed
	- Maintain high accuracy for mathematical tasks

	### Model Details

	- Developed by: Community
	- Model type: Causal Language Model (Quantized)
	- Language(s): English, Mathematics
	- License: MIT
	- Finetuned from model: [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
	- Quantization method: GPTQ (W4A16) via LLM Compressor
	- Calibration dataset: GSM8K (256 samples)

	### Model Sources

	- Base Model: [Qwen/Qwen2.5-Math-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-7B-Instruct)
	- Quantization Tool: [vLLM LLM Compressor](https://docs.vllm.ai/projects/llm-compressor/)

	## Uses

	### Direct Use

	This model is designed for direct use in mathematical and reasoning tasks, including:

	- Solving arithmetic, algebra, and geometry problems
	- Mathematical reasoning and proofs
	- Analyzing and explaining mathematical concepts
	- Educational mathematics support

	### Example Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_path = "your-username/qwen2.5-math-7b-instruct-4bit"
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	device_map="auto",
	dtype="float16",
	trust_remote_code=True,
	low_cpu_mem_usage=False, # Important for compressed models
	)

	# Create prompt
	prompt = "<\|im_start\|>user\nSolve for x: 3x + 5 = 14<\|im_end\|>\n<\|im_start\|>assistant\n"

	# Generate
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Downstream Use

	This model can be further fine-tuned for specific mathematical tasks or integrated into educational applications.

	### Out-of-Scope Use

	This model is NOT designed for:

	- Generating harmful or inappropriate content
	- Use in applications requiring absolute accuracy (such as critical financial calculations)
	- Tasks unrelated to mathematics or reasoning

	## Bias, Risks, and Limitations

	### Limitations

	- The model has been quantized and may have slightly lower accuracy compared to the original model
	- May encounter errors with some complex problems or edge cases
	- Model was primarily trained on English data

	### Recommendations

	Users should:

	- Verify results for important mathematical problems
	- Use the original model (full precision) if maximum accuracy is required
	- Understand that quantization may affect some tasks

	## How to Get Started with the Model

	### Installation

	```bash
	pip install transformers torch accelerate
	```

	### Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "your-username/qwen2.5-math-7b-instruct-4bit"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	dtype="float16",
	trust_remote_code=True,
	low_cpu_mem_usage=False,
	)

	# Use the model
	prompt = "<\|im_start\|>user\nWhat is 2+2?<\|im_end\|>\n<\|im_start\|>assistant\n"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=200)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Quantization Procedure

	The model was quantized using:

	- Method: GPTQ (W4A16)
	- Tool: vLLM LLM Compressor
	- Calibration dataset: GSM8K (256 samples)
	- Max sequence length: 2048 tokens
	- Target layers: All Linear layers except `lm_head`

	### Quantization Hyperparameters

	- Scheme: W4A16 (4-bit weights, 16-bit activations)
	- Block size: 128
	- Dampening fraction: 0.01
	- Calibration samples: 256

	## Evaluation

	### Testing Data

	The model was evaluated on the GSM8K test set.

	### Metrics

	- Accuracy: Measured on GSM8K test set
	- Model size: ~3.5GB (compared to ~14GB of the original model)
	- Compression ratio: ~75% reduction
	- Memory usage: Significantly reduced compared to the original model

	### Results

	The compressed model maintains high accuracy for mathematical tasks while significantly reducing size and memory requirements.

	## Technical Specifications

	### Model Architecture

	- Base Architecture: Qwen2.5 (Transformer-based)
	- Parameters: 7B (quantized to 4-bit)
	- Context Length: 8192 tokens (original model), 2048 tokens (optimized for quantization)
	- Quantization: GPTQ W4A16

	### Compute Infrastructure

	#### Hardware

	- Training/Quantization: NVIDIA RTX 3060 12GB (or equivalent)
	- Minimum Inference: GPU with at least 8GB VRAM

	#### Software

	- Quantization Tool: vLLM LLM Compressor
	- Framework: PyTorch, Transformers
	- Python: >=3.12

	## Citation

	If you use this model, please cite:

	Base Model:

	```bibtex
	@article{qwen2.5,
	title={Qwen2.5: A Large Language Model for Mathematics},
	author={Qwen Team},
	year={2024}
	}
	```

	Quantization Method:

	```bibtex
	@article{gptq,
	title={GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers},
	author={Frantar, Elias and Ashkboos, Saleh and Hoefler, Torsten and Alistarh, Dan},
	journal={arXiv preprint arXiv:2210.17323},
	year={2022}
	}
	```

	## Model Card Contact

	To report issues or ask questions, please open an issue on the repository.

	## Acknowledgments

	- Qwen Team for the original Qwen2.5-Math-7B-Instruct model
	- vLLM team for the LLM Compressor tool
	- Hugging Face for infrastructure and support