--- base_model: QCRI/Fanar-1-9B-Instruct datasets: AI-MO/NuminaMath-TIR library_name: transformers model_name: Fanar-0.5B-GRPO-test tags: - generated_from_trainer - trl - grpo - math - reasoning - R1 licence: license license: apache-2.0 language: - ar - en --- # ๐Ÿง  Fanar-Math-R1-GRPO **Fanar-Math-R1-GRPO** is a reasoning-optimized language model built on [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct). This version is fine-tuned using **Group Relative Policy Optimization (GRPO)** from the DeepSeekMath framework on the [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) dataset. It is designed for step-by-step mathematical problem-solving with structured reasoning in both English and Arabic. --- ## ๐Ÿš€ Model Highlights - ๐Ÿ” Fine-tuned with **GRPO**, a sample-efficient reinforcement learning method - ๐Ÿงฎ Specializes in **multi-step mathematical reasoning** - ๐Ÿ’ฌ Outputs responses in a structured conversational format using `` and `` tags - ๐Ÿง  Trained using **TRL** (`transformers`, `peft`, and `math_verify`) - ๐Ÿท๏ธ Useful for both instruction-following and math-heavy dialogue generation --- ## ๐Ÿ“ฆ Model Details | Component | Description | |------------------|-----------------------------------------------------------------------------| | **Base Model** | [`QCRI/Fanar-1-9B-Instruct`](https://huggingface.co/QCRI/Fanar-1-9B-Instruct) | | **Fine-Tuning** | GRPO via Hugging Face [TRL](https://github.com/huggingface/trl) | | **Dataset** | [`AI-MO/NuminaMath-TIR`](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) | | **Format** | ` ... ... ` tagged reasoning structure | | **LoRA** | Enabled (modules: `q_proj`, `v_proj`, rank=8) | | **Epochs** | 1 (lightweight test configuration) | | **Tokenizer** | Same as base model | --- ## ๐Ÿงช Inference Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch import time model_id = "Omartificial-Intelligence-Space/Fanar-Math-R1-GRPO" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_id) def generate_with_reasoning(prompt_text): inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device) start = time.time() with torch.no_grad(): output = model.generate(**inputs, max_length=1024) end = time.time() generated = tokenizer.decode(output[0], skip_special_tokens=True) duration = end - start num_input_tokens = inputs["input_ids"].shape[1] num_generated_tokens = output.shape[1] - num_input_tokens return generated, duration, num_generated_tokens # Example Arabic math problem prompt_text = '''ููŠ ู…ุฏูŠู†ุฉ ูŠุจู„ุบ ุนุฏุฏ ุณูƒุงู†ู‡ุง 1 ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ุฅุฐุง ูƒุงู† 60% ู…ู† ุงู„ุณูƒุงู† ุจุงู„ุบูŠู†ุŒ ูˆ40% ู…ู† ุงู„ุจุงู„ุบูŠู† ูŠุนู…ู„ูˆู†ุŒ ููƒู… ุนุฏุฏ ุงู„ุนุงู…ู„ูŠู† ููŠ ุงู„ู…ุฏูŠู†ุฉุŸ''' result, time_taken, tokens = generate_with_reasoning(prompt) print(result) ``` --- ## ๐Ÿ› ๏ธ Training Setup ### Configuration Summary - **learning_rate**: 1e-5 - **epochs**: 1 - **max_completion_length**: 64 - **num_generations**: 4 - **gradient_accumulation_steps**: 16 - **logging_steps**: 10 ### Reward Functions - **accuracy_reward**: validates correctness of the answer using `math_verify` - **format_reward**: checks for proper usage of `` and `` tags ### Libraries & Versions ``` transformers==4.47.1 trl==0.14.0 peft==0.14.0 datasets==2.21.0 math_verify==0.3.3 torch==2.4.1 ``` --- ## ๐Ÿ“Š Training Metrics (Snapshot) | Step | Reward (avg) | Accuracy Reward | Format Reward | Loss | KL Divergence | |------|--------------|-----------------|---------------|-------|---------------| | 10 | 0.029 | 0.029 | 0.0 | 0.0 | 0.00024 | | 100 | 0.039 | 0.039 | 0.0 | 0.0001| 0.00188 | | 200 | 0.033 | 0.033 | 0.0 | 0.0001| 0.00183 | | 300 | 0.045 | 0.045 | 0.0 | 0.0001| 0.00127 | *Note: Training was run with a small config for notebook-friendly experimentation.* --- ## ๐Ÿ“š Output Format The model is trained to follow a reasoning-first format: ``` ุฃูˆู„ุงู‹ุŒ ู†ุญุณุจ 60% ู…ู† ู…ู„ูŠูˆู† ู†ุณู…ุฉุŒ ูˆู‡ูˆ 600,000. ุซู… ู†ุญุณุจ 40% ู…ู† ู‡ุฐุง ุงู„ุนุฏุฏุŒ ูˆู‡ูˆ 240,000. 240,000 ``` --- ## ๐Ÿ”ฌ Citations ### GRPO โ€“ DeepSeekMath ```bibtex @article{zhihong2024deepseekmath, title={DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}, author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y.K. and Wu, Y. and Guo, Daya}, journal={arXiv preprint arXiv:2402.03300}, year={2024} } ``` ### TRL Library ```bibtex @misc{vonwerra2022trl, title={TRL: Transformer Reinforcement Learning}, author={von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouรฉdec, Quentin}, year={2022}, howpublished={\url{https://github.com/huggingface/trl}} } ``` --- ## ๐Ÿ”— Resources - [DeepSeekMath Paper](https://arxiv.org/abs/2402.03300) - [TRL Documentation](https://huggingface.co/docs/trl) - [Open-R1 Project](https://github.com/huggingface/open-r1) --- Happy reasoning! ๐Ÿ”โœจ