| | --- |
| | base_model: QCRI/Fanar-1-9B-Instruct |
| | datasets: AI-MO/NuminaMath-TIR |
| | library_name: transformers |
| | model_name: Fanar-0.5B-GRPO-test |
| | tags: |
| | - generated_from_trainer |
| | - trl |
| | - grpo |
| | - math |
| | - reasoning |
| | - R1 |
| | licence: license |
| | license: apache-2.0 |
| | language: |
| | - ar |
| | - en |
| | --- |
| | |
| | # ๐ง Fanar-Math-R1-GRPO |
| |
|
| | **Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic. |
| |
|
| | --- |
| |
|
| | ## ๐ Model Highlights |
| |
|
| | - ๐ Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method |
| | - ๐งฎ Specializes in **multi-step mathematical reasoning** |
| | - ๐ฌ Outputs responses in a structured conversational format using `<think>` and `<answer>` tags |
| | - ๐ง Trained using **TRL** (`transformers`, `peft`, and `math_verify`) |
| | - ๐ท๏ธ Useful for both instruction-following and math-heavy dialogue generation |
| |
|
| | --- |
| |
|
| | ## ๐ฆ Model Details |
| |
|
| | | Component | Description | |
| | |------------------|-----------------------------------------------------------------------------| |
| | | **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) | |
| | | **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) | |
| | | **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) | |
| | | **Format** | `<think> ... </think> <answer> ... </answer>` tagged reasoning structure | |
| | | **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) | |
| | | **Epochs** | 1 (lightweight test configuration) | |
| | | **Tokenizer** | Same as base model | |
| |
|
| | --- |
| |
|
| | ## ๐งช Inference Example |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | import time |
| | |
| | model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO" |
| | model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | |
| | def generate_with_reasoning(prompt_text): |
| | inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device) |
| | start = time.time() |
| | with torch.no_grad(): |
| | output = model.generate(**inputs, max_length=1024) |
| | end = time.time() |
| | |
| | generated = tokenizer.decode(output[0], skip_special_tokens=True) |
| | duration = end - start |
| | num_input_tokens = inputs["input_ids"].shape[1] |
| | num_generated_tokens = output.shape[1] - num_input_tokens |
| | |
| | return generated, duration, num_generated_tokens |
| | |
| | # Example Arabic math problem |
| | prompt_text = '''ูู ู
ุฏููุฉ ูุจูุบ ุนุฏุฏ ุณูุงููุง 1 ู
ูููู ูุณู
ุฉุ ุฅุฐุง ูุงู 60% ู
ู ุงูุณูุงู ุจุงูุบููุ ู40% ู
ู ุงูุจุงูุบูู ูุนู
ูููุ ููู
ุนุฏุฏ ุงูุนุงู
ููู ูู ุงูู
ุฏููุฉุ''' |
| | |
| | result, time_taken, tokens = generate_with_reasoning(prompt) |
| | print(result) |
| | ``` |
| |
|
| | --- |
| |
|
| | ## ๐ ๏ธ Training Setup |
| |
|
| | ### Configuration Summary |
| |
|
| | - **learning_rate**: 1e-5 |
| | - **epochs**: 1 |
| | - **max_completion_length**: 64 |
| | - **num_generations**: 4 |
| | - **gradient_accumulation_steps**: 16 |
| | - **logging_steps**: 10 |
| | |
| | ### Reward Functions |
| | |
| | - **accuracy_reward**: validates correctness of the answer using `math_verify` |
| | - **format_reward**: checks for proper usage of `<think>` and `<answer>` tags |
| | |
| | ### Libraries & Versions |
| | |
| | ``` |
| | transformers==4.47.1 |
| | trl==0.14.0 |
| | peft==0.14.0 |
| | datasets==2.21.0 |
| | math_verify==0.3.3 |
| | torch==2.4.1 |
| | ``` |
| | |
| | --- |
| | |
| | ## ๐ Training Metrics (Snapshot) |
| | |
| | | Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence | |
| | |------|--------------|-----------------|---------------|-------|---------------| |
| | | 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 | |
| | | 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 | |
| | | 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 | |
| | | 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 | |
| | |
| | *Note: Training was run with a small config for notebook-friendly experimentation.* |
| | |
| | --- |
| | |
| | ## ๐ Output Format |
| | |
| | The model is trained to follow a reasoning-first format: |
| | |
| | ``` |
| | <think> ุฃููุงูุ ูุญุณุจ 60% ู
ู ู
ูููู ูุณู
ุฉุ ููู 600,000. ุซู
ูุญุณุจ 40% ู
ู ูุฐุง ุงูุนุฏุฏุ ููู 240,000. </think> |
| | <answer> 240,000 </answer> |
| | ``` |
| | |
| | --- |
| | |
| | ## ๐ฌ Citations |
| | |
| | ### GRPO โ DeepSeekMath |
| | |
| | ```bibtex |
| | @article{zhihong2024deepseekmath, |
| | title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}, |
| | author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya}, |
| | journal={arXiv preprint arXiv:2402.03300}, |
| | year={2024} |
| | } |
| | ``` |
| | |
| | ### TRL Library |
| | |
| | ```bibtex |
| | @misc{vonwerra2022trl, |
| | title={TRL: Transformer Reinforcement Learning}, |
| | author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin}, |
| | year={2022}, |
| | howpublished={\url{https://github.com/huggingface/trl}} |
| | } |
| | ``` |
| | |
| | --- |
| | |
| | ## ๐ Resources |
| | |
| | - [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300) |
| | - [TRL Documentation](https://huggingface.co/docs/trl) |
| | - [Open-R1 Project](https://github.com/huggingface/open-r1) |
| | |
| | --- |
| | |
| | Happy reasoning! ๐โจ |
| | |
| | ## Citations |
| | |
| | Cite GRPO as: |
| | |
| | ```bibtex |
| | @article{zhihong2024deepseekmath, |
| | title = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}}, |
| | author = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo}, |
| | year = 2024, |
| | eprint = {arXiv:2402.03300}, |
| | } |
| | |
| | ``` |
| | |
| | Cite TRL as: |
| | |
| | ```bibtex |
| | @misc{vonwerra2022trl, |
| | title = {{TRL: Transformer Reinforcement Learning}}, |
| | author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouรฉdec}, |
| | year = 2020, |
| | journal = {GitHub repository}, |
| | publisher = {GitHub}, |
| | howpublished = {\url{https://github.com/huggingface/trl}} |
| | } |
| | ``` |