Metin
/

LLaMA-3-8B-Math-Majority-Vote-GRPO

Text Generation

text-generation-inference

test-time-reinforcement-learning

Model card Files Files and versions

Metin commited on May 18, 2025

Commit

bb87305

·

verified ·

1 Parent(s): 5bcefe6

Update README.md

Files changed (1) hide show

README.md +25 -6

README.md CHANGED Viewed

@@ -7,17 +7,36 @@ tags:
 - llama
 - trl
 - grpo
 license: apache-2.0
 language:
 - en
 ---
-# Uploaded  model
-- **Developed by:** Metin
-- **License:** apache-2.0
-- **Finetuned from model :** ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1
-This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 - llama
 - trl
 - grpo
+- test-time-reinforcement-learning
 license: apache-2.0
 language:
 - en
 ---
+# LLaMA-3-8B-Math-Majority-Vote-GRPO
+Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO is a [Test Time Reinforcement Learning (TTRL)](https://arxiv.org/abs/2504.16084) trained version of ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1. It is trained on Turkish math word problems using GRPO method and a majority vote reward function.
+## Training Info
+- **Base Model**: [Turkish-Llama-8b-DPO-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1)
+- **Training Data**: 2.000 open-ended math word problems. No proprietary data was included.
+- **Training Time**: 13 hours on a single L40S
+- **LoRA Configs**:
+  - lora_r: 16
+  - lora_alpha: 16
+  - lora_dropout: 0
+  - lora_target_linear: true
+The goal was to train a model without using any labels or ground truth answers that can reason before generating the answer. It uses the below template:
+```xml
+<mantık>
+...
+</mantık>
+<cevap>
+</cevap>
+```
+For more information visit [my blog post](TO-DO) about this model please