GRPO Fine-tuned Model on GSM8K
This model was fine-tuned using GRPO (Group Relative Policy Optimization) on the GSM8K dataset.
Training Details
- Base Model: Qwen/Qwen2.5-1.5B-Instruct
- Dataset: openai/gsm8k
- Algorithm: GRPO
- Training Steps: Check training logs for details
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("LLMAligned/grpo_gsm8k_model")
tokenizer = AutoTokenizer.from_pretrained("LLMAligned/grpo_gsm8k_model")
prompt = "Solve: What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))