Llama-3-8b-grpo

The model was trained for the LM Playschool Challenge (beta).
It is designed to play games in ClemBench.

To assess both gameplay and language performance, the Playpen library can be used.

Model description

Model type: A model trained on publicly available data from clembench, combined with manually crafted scoring functions.
Language(s) (NLP): Primarily English
License: Llama 3.1 Community License Agreement
Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

Model Sources

Training Repository: https://github.com/paulutsch/playpen
Eval Repository: https://github.com/lm-playpen/playpen

Training Data

The model was trained on a processed and filtered version of the clembench DPO Turn dataset, using additionally created scoring functions for automatically verifiable rewards

Specifically, we used:

clembench-dpo-turn

Model Family

Stage	Llama 3.1 8B
Base Model	meta-llama/llama-3.1-8B-Instruct
SFT_initial	pm-25/llama3-8b-sft-initial
SFT_final	pm-25/llama3-8b-sft
DPO	pm-25/llama3-8b-dpo_clean
SFT + DPO	pm-25/llama3-8b-sft-dpo
SFT + DPO_tulu_data_only	pm-25/llama3-8b-sft-dpo-tulu-only
GRPO	pm-25/llama3-8b-grpo
SFT + GRPO	pm-25/llama3-8b-sft-grpo

Using the model

Loading with HuggingFace

To load the model with HuggingFace, use the following snippet:

from transformers import AutoModelForCausalLM
from peft import PeftModel

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(model, "pm-25/llama3-8b-grpo")

via Playpen

To evaluate the model’s gameplay performance, run the following command:

playpen eval <model-name>

Before evaluation, the model must be registered in the model_registry.json file located in the playpen folder:

{
"model_name": "llama3-8b-grpo",
"backend": "huggingface_local",
"huggingface_id": "meta-llama/Llama-3.1-8B-Instruct",
"release_date": "2025-08-22",
"open_weight": true,
"parameters": "8B",
"languages": ["en", "de", "fr", "it", "pt", "hi", "es", "th"],
"context_size": "128k",
"license": {
"name": "Meta",
"url": "https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE"
},
"model_config": {
  "peft_model": "pm-25/llama3-8b-grpo",
  "requires_api_key": true,
  "premade_chat_template": true,
  "eos_to_cull": "<\|eot_id\|>"
  }
}

Performance

Model	ClemScore	StatScore
Llama-3-8b-sft	42.68	53.25
Llama-3-8b-sft-initial	33.86	55.62
Llama-3-8b-grpo	32.82	57.86
Llama-3.1-8B-Instruct (base)	29.05	55.45
Llama-3-8b-sft-dpo	28.32	55.58
Llama-3-8b-sft-grpo	26.68	57.74
Llama-3-8b-sft-dpo_tulu_only	23.68	58.04
Llama-3-8b-dpo_clean	17.57	52.83
Tulu3-8b-SFT	4.77	55.51
Tulu3-8b-DPO	3.66	56.16
Tulu3-8b	2.41	57.43

Hyperparameters

GRPO:

Learning Rate: 5e-6
Effective Batch Size: 16
Max. Sequence Length: 4096
Loss Accumulation: Sum
Learning Rate Schedule: Linear
LR Warmup Ratio: 0.03
Num. Epochs: 2
bf16: True
Seed: 7331

LoRA Config:

r: 16
lora_alpha: 32
lora_dropout: 0.05
Target Modules: All Linear
Modules to Save: lm_head, embed_tokens

License and use

All Llama 3.1 models are released under Meta's Llama 3.1 Community License Agreement. Llama 3.1 is licensed under the Llama 3.1 Community License, Copyright © Meta Platforms, Inc. It is intended for research and educational use. For more information, please see our Responsible Use Guidelines.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for pm-25/llama3-8b-grpo

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2773)

this model

Dataset used to train pm-25/llama3-8b-grpo

Collection including pm-25/llama3-8b-grpo

LM Playschool Challenge

Collection

Our series of models trained for the beta version of the challenge (sorted by performance) • 7 items • Updated Sep 15, 2025