Fanar-Math-R1-GRPO / README.md
Omartificial-Intelligence-Space's picture
Update README.md
77f0062 verified
|
raw
history blame
5.74 kB
---
base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
- generated_from_trainer
- trl
- grpo
- math
- reasoning
- R1
licence: license
license: apache-2.0
language:
- ar
- en
---
# ๐Ÿง  Fanar-Math-R1-GRPO
**Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.
---
## ๐Ÿš€ Model Highlights
- ๐Ÿ” Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method
- ๐Ÿงฎ Specializes in **multi-step mathematical reasoning**
- ๐Ÿ’ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags
- ๐Ÿง  Trained using **TRL** (`transformers`, `peft`, and `math_verify`)
- ๐Ÿท๏ธ Useful for both instruction-following and math-heavy dialogue generation
---
## ๐Ÿ“ฆ Model Details
| Component | Description |
|------------------|-----------------------------------------------------------------------------|
| **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) |
| **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) |
| **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
| **Format** | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure |
| **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) |
| **Epochs** | 1 (lightweight test configuration) |
| **Tokenizer** | Same as base model |
---
## ๐Ÿงช Inference Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time
model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
def generate_with_reasoning(prompt_text):
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
start = time.time()
with torch.no_grad():
output = model.generate(**inputs, max_length=1024)
end = time.time()
generated = tokenizer.decode(output[0], skip_special_tokens=True)
duration = end - start
num_input_tokens = inputs["input_ids"].shape[1]
num_generated_tokens = output.shape[1] - num_input_tokens
return generated, duration, num_generated_tokens
# Example Arabic math problem
prompt_text = '''ููŠ ู…ุฏูŠู†ุฉ ูŠุจู„ุบ ุนุฏุฏ ุณูƒุงู†ู‡ุง 1 ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ุฅุฐุง ูƒุงู† 60% ู…ู† ุงู„ุณูƒุงู† ุจุงู„ุบูŠู†ุŒ ูˆ40% ู…ู† ุงู„ุจุงู„ุบูŠู† ูŠุนู…ู„ูˆู†ุŒ ููƒู… ุนุฏุฏ ุงู„ุนุงู…ู„ูŠู† ููŠ ุงู„ู…ุฏูŠู†ุฉุŸ'''
result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)
```
---
## ๐Ÿ› ๏ธ Training Setup
### Configuration Summary
- **learning_rate**: 1e-5
- **epochs**: 1
- **max_completion_length**: 64
- **num_generations**: 4
- **gradient_accumulation_steps**: 16
- **logging_steps**: 10
### Reward Functions
- **accuracy_reward**: validates correctness of the answer using `math_verify`
- **format_reward**: checks for proper usage of `<think>` and `<answer>` tags
### Libraries & Versions
```
transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1
```
---
## ๐Ÿ“Š Training Metrics (Snapshot)
| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence |
|------|--------------|-----------------|---------------|-------|---------------|
| 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 |
| 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 |
| 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 |
| 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 |
*Note: Training was run with a small config for notebook-friendly experimentation.*
---
## ๐Ÿ“š Output Format
The model is trained to follow a reasoning-first format:
```
<think> ุฃูˆู„ุงู‹ุŒ ู†ุญุณุจ 60% ู…ู† ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ูˆู‡ูˆ 600,000. ุซู… ู†ุญุณุจ 40% ู…ู† ู‡ุฐุง ุงู„ุนุฏุฏุŒ ูˆู‡ูˆ 240,000. </think>
<answer> 240,000 </answer>
```
---
## ๐Ÿ”ฌ Citations
### GRPO โ€“ DeepSeekMath
```bibtex
@article{zhihong2024deepseekmath,
title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
journal={arXiv preprint arXiv:2402.03300},
year={2024}
}
```
### TRL Library
```bibtex
@misc{vonwerra2022trl,
title={TRL: Transformer Reinforcement Learning},
author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
year={2022},
howpublished={\url{https://github.com/huggingface/trl}}
}
```
---
## ๐Ÿ”— Resources
- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [Open-R1 Project](https://github.com/huggingface/open-r1)
---
Happy reasoning! ๐Ÿ”โœจ