Fanar-Math-R1-GRPO / README.md

Omartificial-Intelligence-Space's picture

Update README.md

77f0062 verified 8 months ago

5.74 kB

	---
	base_model: QCRI/Fanar-1-9B-Instruct
	datasets: AI-MO/NuminaMath-TIR
	library_name: transformers
	model_name: Fanar-0.5B-GRPO-test
	tags:
	- generated_from_trainer
	- trl
	- grpo
	- math
	- reasoning
	- R1
	licence: license
	license: apache-2.0
	language:
	- ar
	- en
	---

	# 🧠 Fanar-Math-R1-GRPO

	Fanar-Math-R1-GRPO is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using Group Relative Policy Optimization (GRPO) from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.

	---

	## 🚀 Model Highlights

	- 🔁 Fine-tuned with GRPO, a sample-efficient reinforcement learning method
	- 🧮 Specializes in multi-step mathematical reasoning
	- 💬 Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
	- 🧠 Trained using TRL (`transformers`, `peft`, and `math_verify`)
	- 🏷️ Useful for both instruction-following and math-heavy dialogue generation

	---

	## 📦 Model Details

	\| Component \| Description \|
	\|------------------\|-----------------------------------------------------------------------------\|
	\| Base Model \| [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) \|
	\| Fine-Tuning \| GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) \|
	\| Dataset \| [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) \|
	\| Format \| `<think> ... </think> <answer> ... </answer>` tagged reasoning structure \|
	\| LoRA \| Enabled (modules: `q_proj`, `v_proj`, rank=8) \|
	\| Epochs \| 1 (lightweight test configuration) \|
	\| Tokenizer \| Same as base model \|

	---

	## 🧪 Inference Example

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch
	import time

	model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	def generate_with_reasoning(prompt_text):
	inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
	start = time.time()
	with torch.no_grad():
	output = model.generate(**inputs, max_length=1024)
	end = time.time()

	generated = tokenizer.decode(output[0], skip_special_tokens=True)
	duration = end - start
	num_input_tokens = inputs["input_ids"].shape[1]
	num_generated_tokens = output.shape[1] - num_input_tokens

	return generated, duration, num_generated_tokens

	# Example Arabic math problem
	prompt_text = '''في مدينة يبلغ عدد سكانها 1 مليون نسمة، إذا كان 60% من السكان بالغين، و40% من البالغين يعملون، فكم عدد العاملين في المدينة؟'''

	result, time_taken, tokens = generate_with_reasoning(prompt)
	print(result)
	```

	---

	## 🛠️ Training Setup

	### Configuration Summary

	- learning_rate: 1e-5
	- epochs: 1
	- max_completion_length: 64
	- num_generations: 4
	- gradient_accumulation_steps: 16
	- logging_steps: 10

	### Reward Functions

	- accuracy_reward: validates correctness of the answer using `math_verify`
	- format_reward: checks for proper usage of `<think>` and `<answer>` tags

	### Libraries & Versions

	```
	transformers==4.47.1
	trl==0.14.0
	peft==0.14.0
	datasets==2.21.0
	math_verify==0.3.3
	torch==2.4.1
	```

	---

	## 📊 Training Metrics (Snapshot)

	\| Step \| Reward (avg) \| Accuracy Reward \| Format Reward \| Loss \| KL Divergence \|
	\|------\|--------------\|-----------------\|---------------\|-------\|---------------\|
	\| 10 \| 0.029 \| 0.029 \| 0.0 \| 0.0 \| 0.00024 \|
	\| 100 \| 0.039 \| 0.039 \| 0.0 \| 0.0001\| 0.00188 \|
	\| 200 \| 0.033 \| 0.033 \| 0.0 \| 0.0001\| 0.00183 \|
	\| 300 \| 0.045 \| 0.045 \| 0.0 \| 0.0001\| 0.00127 \|

	Note: Training was run with a small config for notebook-friendly experimentation.

	---

	## 📚 Output Format

	The model is trained to follow a reasoning-first format:

	```
	<think> أولاً، نحسب 60% من مليون نسمة، وهو 600,000. ثم نحسب 40% من هذا العدد، وهو 240,000. </think>
	<answer> 240,000 </answer>
	```

	---

	## 🔬 Citations

	### GRPO – DeepSeekMath

	```bibtex
	@article{zhihong2024deepseekmath,
	title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
	author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
	journal={arXiv preprint arXiv:2402.03300},
	year={2024}
	}
	```

	### TRL Library

	```bibtex
	@misc{vonwerra2022trl,
	title={TRL: Transformer Reinforcement Learning},
	author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
	year={2022},
	howpublished={\url{https://github.com/huggingface/trl}}
	}
	```

	---

	## 🔗 Resources

	- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
	- [TRL Documentation](https://huggingface.co/docs/trl)
	- [Open-R1 Project](https://github.com/huggingface/open-r1)

	---

	Happy reasoning! 🔍✨