Omartificial-Intelligence-Space's picture

Update README.md

4ff268f verified 9 months ago

7.45 kB

base_model: QCRI/Fanar-1-9B-Instruct
datasets: AI-MO/NuminaMath-TIR
library_name: transformers
model_name: Fanar-0.5B-GRPO-test
tags:
  - generated_from_trainer
  - trl
  - grpo
  - math
  - reasoning
  - R1
licence: license
license: apache-2.0
language:
  - ar
  - en

🧠 Fanar-Math-R1-GRPO

Fanar-Math-R1-GRPO is a reasoning-optimized language model built on QCRI/Fanar-1-9B-Instruct. This version is fine-tuned using Group Relative Policy Optimization (GRPO) from the DeepSeekMath framework on the AI-MO/NuminaMath-TIR dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic.

🚀 Model Highlights

🔁 Fine-tuned with GRPO, a sample-efficient reinforcement learning method
🧮 Specializes in multi-step mathematical reasoning
💬 Outputs responses in a structured conversational format using <think> and <answer> tags
🧠 Trained using TRL (transformers, peft, and math_verify)
🏷️ Useful for both instruction-following and math-heavy dialogue generation

📦 Model Details

Component	Description
Base Model	`QCRI/Fanar-1-9B-Instruct`
Fine-Tuning	GRPO via Hugging Face TRL
Dataset	`AI-MO/NuminaMath-TIR`
Format	`<think> ... </think> <answer> ... </answer>` tagged reasoning structure
LoRA	Enabled (modules: `q_proj`, `v_proj`, rank=8)
Epochs	1 (lightweight test configuration)
Tokenizer	Same as base model

🧪 Inference Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import time

model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def generate_with_reasoning(prompt_text):
    inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
    start = time.time()
    with torch.no_grad():
        output = model.generate(**inputs, max_length=1024)
    end = time.time()

    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    duration = end - start
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output.shape[1] - num_input_tokens

    return generated, duration, num_generated_tokens

# Example Arabic math problem
prompt = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer either in Arabic or English based on user's language. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer> في مدينة يبلغ عدد سكانها 1 مليون نسمة، إذا كان 60% من السكان بالغين، و40% من البالغين يعملون، فكم عدد العاملين في المدينة؟"""

result, time_taken, tokens = generate_with_reasoning(prompt)
print(result)

🛠️ Training Setup

Configuration Summary

learning_rate: 1e-5
epochs: 1
max_completion_length: 64
num_generations: 4
gradient_accumulation_steps: 16
logging_steps: 10

Reward Functions

accuracy_reward: validates correctness of the answer using math_verify
format_reward: checks for proper usage of <think> and <answer> tags

Libraries & Versions

transformers==4.47.1
trl==0.14.0
peft==0.14.0
datasets==2.21.0
math_verify==0.3.3
torch==2.4.1

📊 Training Metrics (Snapshot)

Step	Reward (avg)	Accuracy Reward	Loss	KL Divergence
10	0.029	0.029	0.0	0.00024
100	0.039	0.039	0.0001	0.00188
200	0.033	0.033	0.0001	0.00183
300	0.045	0.045	0.0001	0.00127

Note: Training was run with a small config for notebook-friendly experimentation.

📚 Output Format

The model is trained to follow a reasoning-first format:

<think> First, we calculate 60% of 1 million, which is 600,000. Then, 40% of that is 240,000. </think>
<answer> 240,000 </answer>

🔬 Citations

GRPO – DeepSeekMath

@article{zhihong2024deepseekmath,
  title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}

TRL Library

@misc{vonwerra2022trl,
  title={TRL: Transformer Reinforcement Learning},
  author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
  year={2022},
  howpublished={\url{https://github.com/huggingface/trl}}
}

🔗 Resources

🧑‍🔬 Authors

Developed and trained by Omar Paniego with adaptation of the DeepSeek-R1 training recipe using Hugging Face's open tools and datasets.

📢 License

Refer to the license file in the repository.

❤️ Acknowledgements

Thanks to:

Hugging Face Science Team for trl and math_verify
AI-MO for the NuminaMath-TIR dataset
DeepSeek Team for releasing their methodology and insights

Happy reasoning! 🔍✨

Citations

Cite GRPO as:

@article{zhihong2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}