|
|
--- |
|
|
base_model: QCRI/Fanar-1-9B-Instruct |
|
|
datasets: AI-MO/NuminaMath-TIR |
|
|
library_name: transformers |
|
|
model_name: Fanar-0.5B-GRPO-test |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
- trl |
|
|
- grpo |
|
|
- math |
|
|
- reasoning |
|
|
- R1 |
|
|
licence: license |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
--- |
|
|
|
|
|
# ๐ง Fanar-Math-R1-GRPO |
|
|
|
|
|
**Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Model Highlights |
|
|
|
|
|
- ๐ Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method |
|
|
- ๐งฎ Specializes in **multi-step mathematical reasoning** |
|
|
- ๐ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags |
|
|
- ๐ง Trained using **TRL** (`transformers`, `peft`, and `math_verify`) |
|
|
- ๐ท๏ธ Useful for both instruction-following and math-heavy dialogue generation |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฆ Model Details |
|
|
|
|
|
| Component | Description | |
|
|
|------------------|-----------------------------------------------------------------------------| |
|
|
| **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) | |
|
|
| **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) | |
|
|
| **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) | |
|
|
| **Format** | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure | |
|
|
| **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) | |
|
|
| **Epochs** | 1 (lightweight test configuration) | |
|
|
| **Tokenizer** | Same as base model | |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐งช Inference Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
import time |
|
|
|
|
|
model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO" |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
def generate_with_reasoning(prompt_text): |
|
|
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device) |
|
|
start = time.time() |
|
|
with torch.no_grad(): |
|
|
output = model.generate(**inputs, max_length=1024) |
|
|
end = time.time() |
|
|
|
|
|
generated = tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
duration = end - start |
|
|
num_input_tokens = inputs["input_ids"].shape[1] |
|
|
num_generated_tokens = output.shape[1] - num_input_tokens |
|
|
|
|
|
return generated, duration, num_generated_tokens |
|
|
|
|
|
# Example Arabic math problem |
|
|
prompt_text = '''ูู ู
ุฏููุฉ ูุจูุบ ุนุฏุฏ ุณูุงููุง 1 ู
ูููู ูุณู
ุฉุ ุฅุฐุง ูุงู 60% ู
ู ุงูุณูุงู ุจุงูุบููุ ู40% ู
ู ุงูุจุงูุบูู ูุนู
ูููุ ููู
ุนุฏุฏ ุงูุนุงู
ููู ูู ุงูู
ุฏููุฉุ''' |
|
|
|
|
|
result, time_taken, tokens = generate_with_reasoning(prompt) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ ๏ธ Training Setup |
|
|
|
|
|
### Configuration Summary |
|
|
|
|
|
- **learning_rate**: 1e-5 |
|
|
- **epochs**: 1 |
|
|
- **max_completion_length**: 64 |
|
|
- **num_generations**: 4 |
|
|
- **gradient_accumulation_steps**: 16 |
|
|
- **logging_steps**: 10 |
|
|
|
|
|
### Reward Functions |
|
|
|
|
|
- **accuracy_reward**: validates correctness of the answer using `math_verify` |
|
|
- **format_reward**: checks for proper usage of `<think>` and `<answer>` tags |
|
|
|
|
|
### Libraries & Versions |
|
|
|
|
|
``` |
|
|
transformers==4.47.1 |
|
|
trl==0.14.0 |
|
|
peft==0.14.0 |
|
|
datasets==2.21.0 |
|
|
math_verify==0.3.3 |
|
|
torch==2.4.1 |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Training Metrics (Snapshot) |
|
|
|
|
|
| Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence | |
|
|
|------|--------------|-----------------|---------------|-------|---------------| |
|
|
| 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 | |
|
|
| 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 | |
|
|
| 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 | |
|
|
| 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 | |
|
|
|
|
|
*Note: Training was run with a small config for notebook-friendly experimentation.* |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Output Format |
|
|
|
|
|
The model is trained to follow a reasoning-first format: |
|
|
|
|
|
``` |
|
|
<think> ุฃููุงูุ ูุญุณุจ 60% ู
ู ู
ูููู ูุณู
ุฉุ ููู 600,000. ุซู
ูุญุณุจ 40% ู
ู ูุฐุง ุงูุนุฏุฏุ ููู 240,000. </think> |
|
|
<answer> 240,000 </answer> |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฌ Citations |
|
|
|
|
|
### GRPO โ DeepSeekMath |
|
|
|
|
|
```bibtex |
|
|
@article{zhihong2024deepseekmath, |
|
|
title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}, |
|
|
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya}, |
|
|
journal={arXiv preprint arXiv:2402.03300}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
### TRL Library |
|
|
|
|
|
```bibtex |
|
|
@misc{vonwerra2022trl, |
|
|
title={TRL: Transformer Reinforcement Learning}, |
|
|
author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin}, |
|
|
year={2022}, |
|
|
howpublished={\url{https://github.com/huggingface/trl}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Resources |
|
|
|
|
|
- [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300) |
|
|
- [TRL Documentation](https://huggingface.co/docs/trl) |
|
|
- [Open-R1 Project](https://github.com/huggingface/open-r1) |
|
|
|
|
|
--- |
|
|
|
|
|
Happy reasoning! ๐โจ |