openai/gsm8k
Benchmark β’ Updated β’ 17.6k β’ 899k β’ 1.4k
GRPO is a causal language model fine-tuned using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm built on PPO that optimizes language models via groupwise reward comparisons. This approach aligns model outputs with reward functions through relative ranking among multiple completions per prompt, making it well-suited for structured generation tasks such as Chain-of-Thought (CoT) reasoning.
HuggingFaceTB/SmolLM-135M-Instruct reward_len (length-based) from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("HuangXinBa/GRPO")
tokenizer = AutoTokenizer.from_pretrained("HuangXinBa/GRPO")
prompt = "<your prompt here>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
@misc{grpo2025,
title={GRPO-1: Finetuning a Language Model with Generalized Reinforcement Policy Optimization},
author={Huang Jinting},
year={2025},
note={https://wandb.ai/ggg7334-the-school-of-the-new-york-times/GRPO}
}