--- title: RLM Arithmetic Training emoji: 🔢 colorFrom: blue colorTo: purple sdk: docker pinned: false hardware: t4-small --- # GRPO + RLVR Arithmetic Training Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR. ## Task Solve arithmetic problems like: - 47 + 35 = 82 - 92 - 17 = 75 ## Approach - **Model:** Qwen/Qwen3-0.6B-Base - **Method:** GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards) - **Reward:** Exact match on answer - **Steps:** 50 ## Expected Results Base model (no math training) should perform poorly (<10%), trained model should improve significantly.