File size: 6,602 Bytes
01d98c7 aa6b636 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 40dd8fa 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 01d98c7 4ff268f 40dd8fa 4ff268f 40dd8fa 01d98c7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | ---
base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
- generated_from_trainer
- trl
- grpo
- math
- reasoning
- R1
licence: license
license: apache-2.0
language:
- ar
- en
---
# ๐ง Fanar-Math-R1-GRPO
**Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.
---
## ๐ Model Highlights
- ๐ Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method
- ๐งฎ Specializes in **multi-step mathematical reasoning**
- ๐ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
- ๐ง Trained using **TRL** (`transformers`, `peft`, and `math_verify`)
- ๐ท๏ธ Useful for both instruction-following and math-heavy dialogue generation
---
## ๐ฆ Model Details
| Component | Description |
|------------------|-----------------------------------------------------------------------------|
| **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) |
| **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) |
| **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
| **Format** | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure |
| **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) |
| **Epochs** | 1 (lightweight test configuration) |
| **Tokenizer** | Same as base model |
---
## ๐งช Inference Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time
model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
def generate_with_reasoning(prompt_text):
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
start = time.time()
with torch.no_grad():
output = model.generate(**inputs, max_length=1024)
end = time.time()
generated = tokenizer.decode(output[0], skip_special_tokens=True)
duration = end - start
num_input_tokens = inputs["input_ids"].shape[1]
num_generated_tokens = output.shape[1] - num_input_tokens
return generated, duration, num_generated_tokens
# Example Arabic math problem
prompt_text = '''ูู ู
ุฏููุฉ ูุจูุบ ุนุฏุฏ ุณูุงููุง 1 ู
ูููู ูุณู
ุฉุ ุฅุฐุง ูุงู 60% ู
ู ุงูุณูุงู ุจุงูุบููุ ู40% ู
ู ุงูุจุงูุบูู ูุนู
ูููุ ููู
ุนุฏุฏ ุงูุนุงู
ููู ูู ุงูู
ุฏููุฉุ'''
result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)
```
---
## ๐ ๏ธ Training Setup
### Configuration Summary
- **learning_rate**: 1e-5
- **epochs**: 1
- **max_completion_length**: 64
- **num_generations**: 4
- **gradient_accumulation_steps**: 16
- **logging_steps**: 10
### Reward Functions
- **accuracy_reward**: validates correctness of the answer using `math_verify`
- **format_reward**: checks for proper usage of `<think>` and `<answer>` tags
### Libraries & Versions
```
transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1
```
---
## ๐ Training Metrics (Snapshot)
| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence |
|------|--------------|-----------------|---------------|-------|---------------|
| 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 |
| 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 |
| 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 |
| 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 |
*Note: Training was run with a small config for notebook-friendly experimentation.*
---
## ๐ Output Format
The model is trained to follow a reasoning-first format:
```
<think> ุฃููุงูุ ูุญุณุจ 60% ู
ู ู
ูููู ูุณู
ุฉุ ููู 600,000. ุซู
ูุญุณุจ 40% ู
ู ูุฐุง ุงูุนุฏุฏุ ููู 240,000. </think>
<answer> 240,000 </answer>
```
---
## ๐ฌ Citations
### GRPO โ DeepSeekMath
```bibtex
@article{zhihong2024deepseekmath,
title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
journal={arXiv preprint arXiv:2402.03300},
year={2024}
}
```
### TRL Library
```bibtex
@misc{vonwerra2022trl,
title={TRL: Transformer Reinforcement Learning},
author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
year={2022},
howpublished={\url{https://github.com/huggingface/trl}}
}
```
---
## ๐ Resources
- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [Open-R1 Project](https://github.com/huggingface/open-r1)
---
Happy reasoning! ๐โจ
## Citations
Cite GRPO as:
```bibtex
@article{zhihong2024deepseekmath,
title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year = 2024,
eprint = {arXiv:2402.03300},
}
```
Cite TRL as:
```bibtex
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouรฉdec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
``` |