File size: 5,744 Bytes
01d98c7
 
aa6b636
01d98c7
 
 
 
 
 
4ff268f
 
 
01d98c7
4ff268f
 
 
 
01d98c7
 
4ff268f
01d98c7
4ff268f
01d98c7
4ff268f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
01d98c7
 
4ff268f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40dd8fa
4ff268f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
01d98c7
4ff268f
 
 
 
 
 
 
 
 
01d98c7
 
4ff268f
 
 
 
 
 
 
 
 
 
 
 
 
 
01d98c7
4ff268f
01d98c7
4ff268f
01d98c7
4ff268f
40dd8fa
4ff268f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77f0062
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
- generated_from_trainer
- trl
- grpo
- math
- reasoning
- R1
licence: license
license: apache-2.0
language:
- ar
- en
---

# ๐Ÿง  Fanar-Math-R1-GRPO

**Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.

---

## ๐Ÿš€ Model Highlights

- ๐Ÿ” Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method
- ๐Ÿงฎ Specializes in **multi-step mathematical reasoning**
- ๐Ÿ’ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
- ๐Ÿง  Trained using **TRL** (`transformers`, `peft`, and `math_verify`)
- ๐Ÿท๏ธ Useful for both instruction-following and math-heavy dialogue generation

---

## ๐Ÿ“ฆ Model Details

| Component        | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| **Base Model**   | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) |
| **Fine-Tuning**  | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl)              |
| **Dataset**      | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
| **Format**       | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure     |
| **LoRA**         | Enabled (modules: `q_proj`, `v_proj`, rank=8)                               |
| **Epochs**       | 1 (lightweight test configuration)                                           |
| **Tokenizer**    | Same as base model                                                           |

---

## ๐Ÿงช Inference Example

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def generate_with_reasoning(prompt_text):
    inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
    start = time.time()
    with torch.no_grad():
        output = model.generate(**inputs, max_length=1024)
    end = time.time()

    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    duration = end - start
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output.shape[1] - num_input_tokens

    return generated, duration, num_generated_tokens

# Example Arabic math problem
prompt_text = '''ููŠ ู…ุฏูŠู†ุฉ ูŠุจู„ุบ ุนุฏุฏ ุณูƒุงู†ู‡ุง 1 ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ุฅุฐุง ูƒุงู† 60% ู…ู† ุงู„ุณูƒุงู† ุจุงู„ุบูŠู†ุŒ ูˆ40% ู…ู† ุงู„ุจุงู„ุบูŠู† ูŠุนู…ู„ูˆู†ุŒ ููƒู… ุนุฏุฏ ุงู„ุนุงู…ู„ูŠู† ููŠ ุงู„ู…ุฏูŠู†ุฉุŸ'''

result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)
```

---

## ๐Ÿ› ๏ธ Training Setup

### Configuration Summary

- **learning_rate**: 1e-5
- **epochs**: 1
- **max_completion_length**: 64
- **num_generations**: 4
- **gradient_accumulation_steps**: 16
- **logging_steps**: 10

### Reward Functions

- **accuracy_reward**: validates correctness of the answer using `math_verify`
- **format_reward**: checks for proper usage of `<think>` and `<answer>` tags

### Libraries & Versions

```
transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1
```

---

## ๐Ÿ“Š Training Metrics (Snapshot)

| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss  | KL Divergence |
|------|--------------|-----------------|---------------|-------|---------------|
| 10   | 0.029        | 0.029           | 0.0           | 0.0   | 0.00024       |
| 100  | 0.039        | 0.039           | 0.0           | 0.0001| 0.00188       |
| 200  | 0.033        | 0.033           | 0.0           | 0.0001| 0.00183       |
| 300  | 0.045        | 0.045           | 0.0           | 0.0001| 0.00127       |

*Note: Training was run with a small config for notebook-friendly experimentation.*

---

## ๐Ÿ“š Output Format

The model is trained to follow a reasoning-first format:

```
<think> ุฃูˆู„ุงู‹ุŒ ู†ุญุณุจ 60% ู…ู† ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ูˆู‡ูˆ 600,000. ุซู… ู†ุญุณุจ 40% ู…ู† ู‡ุฐุง ุงู„ุนุฏุฏุŒ ูˆู‡ูˆ 240,000. </think>
<answer> 240,000 </answer>
```

---

## ๐Ÿ”ฌ Citations

### GRPO โ€“ DeepSeekMath

```bibtex
@article{zhihong2024deepseekmath,
  title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}
```

### TRL Library

```bibtex
@misc{vonwerra2022trl,
  title={TRL: Transformer Reinforcement Learning},
  author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
  year={2022},
  howpublished={\url{https://github.com/huggingface/trl}}
}
```

---

## ๐Ÿ”— Resources

- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [Open-R1 Project](https://github.com/huggingface/open-r1)

---

Happy reasoning! ๐Ÿ”โœจ