metadata
base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
- generated_from_trainer
- trl
- grpo
- math
- reasoning
- R1
licence: license
license: apache-2.0
language:
- ar
- en
๐ง Fanar-Math-R1-GRPO
Fanar-Math-R1-GRPO is a reasoning-optimized language model built on QCRI/Fanar-1-9B-Instruct. This version is fine-tuned using Group Relative Policy Optimization (GRPO) from the DeepSeekMath framework on the AI-MO/NuminaMath-TIR dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.
๐ Model Highlights
- ๐ Fine-tuned with GRPO, a sample-efficient reinforcement learning method
- ๐งฎ Specializes in multi-step mathematical reasoning
- ๐ฌ Outputs responses in a structured conversational format using
<think>and<answer>tags - ๐ง Trained using TRL (
transformers,peft, andmath_verify) - ๐ท๏ธ Useful for both instruction-following and math-heavy dialogue generation
๐ฆ Model Details
| Component | Description |
|---|---|
| Base Model | QCRI/Fanar-1-9B-Instruct |
| Fine-Tuning | GRPO via Hugging Face TRL |
| Dataset | AI-MO/NuminaMath-TIR |
| Format | <think> ... </think> <answer> ... </answer> tagged reasoning structure |
| LoRA | Enabled (modules: q_proj, v_proj, rank=8) |
| Epochs | 1 (lightweight test configuration) |
| Tokenizer | Same as base model |
๐งช Inference Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time
model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
def generate_with_reasoning(prompt_text):
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
start = time.time()
with torch.no_grad():
output = model.generate(**inputs, max_length=1024)
end = time.time()
generated = tokenizer.decode(output[0], skip_special_tokens=True)
duration = end - start
num_input_tokens = inputs["input_ids"].shape[1]
num_generated_tokens = output.shape[1] - num_input_tokens
return generated, duration, num_generated_tokens
# Example Arabic math problem
prompt_text = '''ูู ู
ุฏููุฉ ูุจูุบ ุนุฏุฏ ุณูุงููุง 1 ู
ูููู ูุณู
ุฉุ ุฅุฐุง ูุงู 60% ู
ู ุงูุณูุงู ุจุงูุบููุ ู40% ู
ู ุงูุจุงูุบูู ูุนู
ูููุ ููู
ุนุฏุฏ ุงูุนุงู
ููู ูู ุงูู
ุฏููุฉุ'''
result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)
๐ ๏ธ Training Setup
Configuration Summary
- learning_rate: 1e-5
- epochs: 1
- max_completion_length: 64
- num_generations: 4
- gradient_accumulation_steps: 16
- logging_steps: 10
Reward Functions
- accuracy_reward: validates correctness of the answer using
math_verify - format_reward: checks for proper usage of
<think>and<answer>tags
Libraries & Versions
transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1
๐ Training Metrics (Snapshot)
| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence |
|---|---|---|---|---|---|
| 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 |
| 100 | 0.039 | 0.039 | 0.0 | 0.0001 | 0.00188 |
| 200 | 0.033 | 0.033 | 0.0 | 0.0001 | 0.00183 |
| 300 | 0.045 | 0.045 | 0.0 | 0.0001 | 0.00127 |
Note: Training was run with a small config for notebook-friendly experimentation.
๐ Output Format
The model is trained to follow a reasoning-first format:
<think> ุฃููุงูุ ูุญุณุจ 60% ู
ู ู
ูููู ูุณู
ุฉุ ููู 600,000. ุซู
ูุญุณุจ 40% ู
ู ูุฐุง ุงูุนุฏุฏุ ููู 240,000. </think>
<answer> 240,000 </answer>
๐ฌ Citations
GRPO โ DeepSeekMath
@article{zhihong2024deepseekmath,
title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
journal={arXiv preprint arXiv:2402.03300},
year={2024}
}
TRL Library
@misc{vonwerra2022trl,
title={TRL: Transformer Reinforcement Learning},
author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
year={2022},
howpublished={\url{https://github.com/huggingface/trl}}
}
๐ Resources
Happy reasoning! ๐โจ