Improve model card: add pipeline tag, library name, language, license, paper, and code links (#1)

c031abc verified 7 months ago

2.96 kB

	---
	base_model: ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1
	language:
	- en
	- tr
	license: apache-2.0
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- llama
	- trl
	- grpo
	- test-time-reinforcement-learning
	---

	<img src="https://huggingface.co/Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO/resolve/main/llama_clones.png"
	alt="A scene from a famous movie" width="800"/>

	# LLaMA-3-8B-Math-Majority-Vote-GRPO

	Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO is a [Test Time Reinforcement Learning (TTRL)](https://arxiv.org/abs/2504.16084) trained version of ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1. It is trained on Turkish math word problems using GRPO method and a majority vote reward function.

	Paper: [TTRL: Test-Time Reinforcement Learning](https://huggingface.co/papers/2504.16084)
	Code: [https://github.com/PRIME-RL/TTRL](https://github.com/PRIME-RL/TTRL)

	## Training Info

	- Base Model: [Turkish-Llama-8b-DPO-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-Llama-8b-DPO-v0.1)
	- Training Data: 2.000 open-ended math word problems. No proprietary data was included.
	- Training Time: 13 hours on a single L40S

	- LoRA Configs:
	- lora_r: 16
	- lora_alpha: 16
	- lora_dropout: 0
	- lora_target_linear: true

	The goal was to train a model without using any labels or ground truth answers that can reason before generating the answer. It uses the below template:

	```xml
	<mantık>
	...
	</mantık>
	<cevap>
	</cevap>
	```

	For more information visit [my blog post](https://metinusta.github.io/post.html?slug=test-time-reinforcement-learning) about this model please.

	## How to use

	1. Install vLLM
	```bash
	pip install vllm
	```
	2.
	```python
	from vllm import LLM, SamplingParams
	import json

	llm = LLM(model="Metin/LLaMA-3-8B-Math-Majority-Vote-GRPO")

	sampling_params = SamplingParams(temperature=0.5)

	SYSTEM_PROMPT = """
	Sana verilen matematik problemi hakkında düşün ve çözümü bul.
	Düşüncelerini <mantık> ve </mantık> arasına yaz.
	Sonucu ise <cevap> ve </cevap> arasına yaz. Sonucu yazarken sadece rakamları, noktayı ve virgülü kullan. Noktayı binlik ayracı, virgülü ise ondalık ayracı olarak kullanmalısın. Örnek: <cevap>1.450,02</cevap>
	"""

	conversation = [
	{
	"role": "system",
	"content": SYSTEM_PROMPT
	}
	{
	"role": "user",
	"content": "Nüfus 20.000'dir. Nüfus her yıl %10 artmaktadır. Buna göre üç yıl sonra nüfus kaç olur?"
	}
	]

	outputs = llm.chat(
	conversation,
	sampling_params=sampling_params,
	use_tqdm=False
	)

	result = json.loads(outputs[0].outputs[0].text)

	print(result)
	```

	# Citation
	```bibtex
	@article{zuo2025ttrl,
	title={Ttrl: Test-time reinforcement learning},
	author={Zuo, Yuxin and Zhang, Kaiyan and Qu, Shang and Sheng, Li and Zhu, Xuekai and Qi, Biqing and Sun, Youbang and Cui, Ganqu and Ding, Ning and Zhou, Bowen},
	journal={arXiv preprint arXiv:2504.16084},
	year={2025}
	}
	```