---
base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
- generated_from_trainer
- trl
- grpo
- math
- reasoning
- R1
licence: license
license: apache-2.0
language:
- ar
- en
---

# 🧠 Fanar-Math-R1-GRPO

**Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.

---

## 🚀 Model Highlights

- 🔁 Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method
- 🧮 Specializes in **multi-step mathematical reasoning**
- 💬 Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
- 🧠 Trained using **TRL** (`transformers`, `peft`, and `math_verify`)
- 🏷️ Useful for both instruction-following and math-heavy dialogue generation

---

## 📦 Model Details

| Component        | Description                                                                 |
|------------------|-----------------------------------------------------------------------------|
| **Base Model**   | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) |
| **Fine-Tuning**  | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl)              |
| **Dataset**      | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
| **Format**       | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure     |
| **LoRA**         | Enabled (modules: `q_proj`, `v_proj`, rank=8)                               |
| **Epochs**       | 1 (lightweight test configuration)                                           |
| **Tokenizer**    | Same as base model                                                           |

---

## 🧪 Inference Example

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def generate_with_reasoning(prompt_text):
    inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
    start = time.time()
    with torch.no_grad():
        output = model.generate(**inputs, max_length=1024)
    end = time.time()

    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    duration = end - start
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output.shape[1] - num_input_tokens

    return generated, duration, num_generated_tokens

# Example Arabic math problem
prompt_text = '''في مدينة يبلغ عدد سكانها 1 مليون نسمة، إذا كان 60% من السكان بالغين، و40% من البالغين يعملون، فكم عدد العاملين في المدينة؟'''

result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)
```

---

## 🛠️ Training Setup

### Configuration Summary

- **learning_rate**: 1e-5
- **epochs**: 1
- **max_completion_length**: 64
- **num_generations**: 4
- **gradient_accumulation_steps**: 16
- **logging_steps**: 10

### Reward Functions

- **accuracy_reward**: validates correctness of the answer using `math_verify`
- **format_reward**: checks for proper usage of `<think>` and `<answer>` tags

### Libraries & Versions

```
transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1
```

---

## 📊 Training Metrics (Snapshot)

| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss  | KL Divergence |
|------|--------------|-----------------|---------------|-------|---------------|
| 10   | 0.029        | 0.029           | 0.0           | 0.0   | 0.00024       |
| 100  | 0.039        | 0.039           | 0.0           | 0.0001| 0.00188       |
| 200  | 0.033        | 0.033           | 0.0           | 0.0001| 0.00183       |
| 300  | 0.045        | 0.045           | 0.0           | 0.0001| 0.00127       |

*Note: Training was run with a small config for notebook-friendly experimentation.*

---

## 📚 Output Format

The model is trained to follow a reasoning-first format:

```
<think> أولاً، نحسب 60% من مليون نسمة، وهو 600,000. ثم نحسب 40% من هذا العدد، وهو 240,000. </think>
<answer> 240,000 </answer>
```

---

## 🔬 Citations

### GRPO – DeepSeekMath

```bibtex
@article{zhihong2024deepseekmath,
  title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}
```

### TRL Library

```bibtex
@misc{vonwerra2022trl,
  title={TRL: Transformer Reinforcement Learning},
  author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  year={2022},
  howpublished={\url{https://github.com/huggingface/trl}}
}
```

---

## 🔗 Resources

- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [Open-R1 Project](https://github.com/huggingface/open-r1)

---

Happy reasoning! 🔍✨