| | --- |
| | base_model: ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1 |
| | language: |
| | - en |
| | - tr |
| | license: apache-2.0 |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | tags: |
| | - text-generation-inference |
| | - transformers |
| | - unsloth |
| | - llama |
| | - trl |
| | - grpo |
| | - test-time-reinforcement-learning |
| | --- |
| | |
| | <img src="https://huggingface.co/Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO/resolve/main/llama_clones.png" |
| | alt="A scene from a famous movie" width="800"/> |
| |
|
| | # LLaMA-3-8B-Math-Majority-Vote-GRPO |
| |
|
| | Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO is a [Test Time Reinforcement Learning (TTRL)](https://arxiv.org/abs/2504.16084) trained version of ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1. It is trained on Turkish math word problems using GRPO method and a majority vote reward function. |
| |
|
| | **Paper:** [TTRL: Test-Time Reinforcement Learning](https://huggingface.co/papers/2504.16084) |
| | **Code:** [https://github.com/PRIME-RL/TTRL](https://github.com/PRIME-RL/TTRL) |
| |
|
| | ## Training Info |
| |
|
| | - **Base Model**: [Turkish-Llama-8b-DPO-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1) |
| | - **Training Data**: 2.000 open-ended math word problems. No proprietary data was included. |
| | - **Training Time**: 13 hours on a single L40S |
| |
|
| | - **LoRA Configs**: |
| | - lora_r: 16 |
| | - lora_alpha: 16 |
| | - lora_dropout: 0 |
| | - lora_target_linear: true |
| | |
| | The goal was to train a model without using any labels or ground truth answers that can reason before generating the answer. It uses the below template: |
| | |
| | ```xml |
| | <mantık> |
| | ... |
| | </mantık> |
| | <cevap> |
| | </cevap> |
| | ``` |
| | |
| | For more information visit [my blog post](https://metinusta.github.io/post.html?slug=test-time-reinforcement-learning) about this model please. |
| | |
| | ## How to use |
| | |
| | 1. Install vLLM |
| | ```bash |
| | pip install vllm |
| | ``` |
| | 2. |
| | ```python |
| | from vllm import LLM, SamplingParams |
| | import json |
| | |
| | llm = LLM(model="Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO") |
| | |
| | sampling_params = SamplingParams(temperature=0.5) |
| |
|
| | SYSTEM_PROMPT = """ |
| | Sana verilen matematik problemi hakkında düşün ve çözümü bul. |
| | Düşüncelerini <mantık> ve </mantık> arasına yaz. |
| | Sonucu ise <cevap> ve </cevap> arasına yaz. Sonucu yazarken sadece rakamları, noktayı ve virgülü kullan. Noktayı binlik ayracı, virgülü ise ondalık ayracı olarak kullanmalısın. Örnek: <cevap>1.450,02</cevap> |
| | """ |
| | |
| | conversation = [ |
| | { |
| | "role": "system", |
| | "content": SYSTEM_PROMPT |
| | } |
| | { |
| | "role": "user", |
| | "content": "Nüfus 20.000'dir. Nüfus her yıl %10 artmaktadır. Buna göre üç yıl sonra nüfus kaç olur?" |
| | } |
| | ] |
| | |
| | outputs = llm.chat( |
| | conversation, |
| | sampling_params=sampling_params, |
| | use_tqdm=False |
| | ) |
| | |
| | result = json.loads(outputs[0].outputs[0].text) |
| | |
| | print(result) |
| | ``` |
| | |
| | # Citation |
| | ```bibtex |
| | @article{zuo2025ttrl, |
| | title={Ttrl: Test-time reinforcement learning}, |
| | author={Zuo, Yuxin and Zhang, Kaiyan and Qu, Shang and Sheng, Li and Zhu, Xuekai and Qi, Biqing and Sun, Youbang and Cui, Ganqu and Ding, Ning and Zhou, Bowen}, |
| | journal={arXiv preprint arXiv:2504.16084}, |
| | year={2025} |
| | } |
| | ``` |