Spaces:
Runtime error
Runtime error
metadata
title: RLM Arithmetic Training
emoji: π’
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
hardware: t4-small
GRPO + RLVR Arithmetic Training
Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR.
Task
Solve arithmetic problems like:
- 47 + 35 = 82
- 92 - 17 = 75
Approach
- Model: Qwen/Qwen3-0.6B-Base
- Method: GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards)
- Reward: Exact match on answer
- Steps: 50
Expected Results
Base model (no math training) should perform poorly (<10%), trained model should improve significantly.