Fanar-Math-R1-GRPO / README.md
Omartificial-Intelligence-Space's picture
Update README.md
77f0062 verified
|
raw
history blame
5.74 kB
metadata
base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
  - generated_from_trainer
  - trl
  - grpo
  - math
  - reasoning
  - R1
licence: license
license: apache-2.0
language:
  - ar
  - en

๐Ÿง  Fanar-Math-R1-GRPO

Fanar-Math-R1-GRPO is a reasoning-optimized language model built on QCRI/Fanar-1-9B-Instruct. This version is fine-tuned using Group Relative Policy Optimization (GRPO) from the DeepSeekMath framework on the AI-MO/NuminaMath-TIR dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.


๐Ÿš€ Model Highlights

  • ๐Ÿ” Fine-tuned with GRPO, a sample-efficient reinforcement learning method
  • ๐Ÿงฎ Specializes in multi-step mathematical reasoning
  • ๐Ÿ’ฌ Outputs responses in a structured conversational format using <think> and <answer> tags
  • ๐Ÿง  Trained using TRL (transformers, peft, and math_verify)
  • ๐Ÿท๏ธ Useful for both instruction-following and math-heavy dialogue generation

๐Ÿ“ฆ Model Details

Component Description
Base Model QCRI/Fanar-1-9B-Instruct
Fine-Tuning GRPO via Hugging Face TRL
Dataset AI-MO/NuminaMath-TIR
Format <think> ... </think> <answer> ... </answer> tagged reasoning structure
LoRA Enabled (modules: q_proj, v_proj, rank=8)
Epochs 1 (lightweight test configuration)
Tokenizer Same as base model

๐Ÿงช Inference Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def generate_with_reasoning(prompt_text):
    inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
    start = time.time()
    with torch.no_grad():
        output = model.generate(**inputs, max_length=1024)
    end = time.time()

    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    duration = end - start
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output.shape[1] - num_input_tokens

    return generated, duration, num_generated_tokens

# Example Arabic math problem
prompt_text = '''ููŠ ู…ุฏูŠู†ุฉ ูŠุจู„ุบ ุนุฏุฏ ุณูƒุงู†ู‡ุง 1 ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ุฅุฐุง ูƒุงู† 60% ู…ู† ุงู„ุณูƒุงู† ุจุงู„ุบูŠู†ุŒ ูˆ40% ู…ู† ุงู„ุจุงู„ุบูŠู† ูŠุนู…ู„ูˆู†ุŒ ููƒู… ุนุฏุฏ ุงู„ุนุงู…ู„ูŠู† ููŠ ุงู„ู…ุฏูŠู†ุฉุŸ'''

result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)

๐Ÿ› ๏ธ Training Setup

Configuration Summary

  • learning_rate: 1e-5
  • epochs: 1
  • max_completion_length: 64
  • num_generations: 4
  • gradient_accumulation_steps: 16
  • logging_steps: 10

Reward Functions

  • accuracy_reward: validates correctness of the answer using math_verify
  • format_reward: checks for proper usage of <think> and <answer> tags

Libraries & Versions

transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1

๐Ÿ“Š Training Metrics (Snapshot)

Step Reward (avg) Accuracy Reward Format Reward Loss KL Divergence
10 0.029 0.029 0.0 0.0 0.00024
100 0.039 0.039 0.0 0.0001 0.00188
200 0.033 0.033 0.0 0.0001 0.00183
300 0.045 0.045 0.0 0.0001 0.00127

Note: Training was run with a small config for notebook-friendly experimentation.


๐Ÿ“š Output Format

The model is trained to follow a reasoning-first format:

<think> ุฃูˆู„ุงู‹ุŒ ู†ุญุณุจ 60% ู…ู† ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ูˆู‡ูˆ 600,000. ุซู… ู†ุญุณุจ 40% ู…ู† ู‡ุฐุง ุงู„ุนุฏุฏุŒ ูˆู‡ูˆ 240,000. </think>
<answer> 240,000 </answer>

๐Ÿ”ฌ Citations

GRPO โ€“ DeepSeekMath

@article{zhihong2024deepseekmath,
  title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}

TRL Library

@misc{vonwerra2022trl,
  title={TRL: Transformer Reinforcement Learning},
  author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin},
  year={2022},
  howpublished={\url{https://github.com/huggingface/trl}}
}

๐Ÿ”— Resources


Happy reasoning! ๐Ÿ”โœจ