tinyllms/aime-1983-2023-trajectories
Viewer • Updated • 1.84k • 29
GRPO-trained from tinyllms/qwen2.5-7b-instruct-sft-loo-domain-knowledge (itself SFT'd from Qwen/Qwen2.5-7B-Instruct) using QLoRA (4-bit NF4 quantization + LoRA adapters).
This is a capped variant (max 12 steps) of the GRPO stage of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately excluding domain knowledge (GPQA) data. The held-out domain is later used to measure cross-domain transfer.
<answer> tag formatting, penalises excessive lengthTrained on two datasets (domain knowledge held out):
| Dataset | Domain |
|---|---|
tinyllms/game24-trajectories |
Game of 24 — arithmetic reasoning |
tinyllms/aime-1983-2023-trajectories |
AIME — competition math |
| Domain | Role |
|---|---|
| Game24 | Train |
| AIME | Train |
| Domain Knowledge (GPQA) | Held out |
Transfer is measured by evaluating on GPQA Diamond.
pocket-sheet-grpo)Base model
Qwen/Qwen2.5-7B