Spaces:
Runtime error
Runtime error
| title: RLM Arithmetic Training | |
| emoji: π’ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| hardware: t4-small | |
| # GRPO + RLVR Arithmetic Training | |
| Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR. | |
| ## Task | |
| Solve arithmetic problems like: | |
| - 47 + 35 = 82 | |
| - 92 - 17 = 75 | |
| ## Approach | |
| - **Model:** Qwen/Qwen3-0.6B-Base | |
| - **Method:** GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards) | |
| - **Reward:** Exact match on answer | |
| - **Steps:** 50 | |
| ## Expected Results | |
| Base model (no math training) should perform poorly (<10%), trained model should improve significantly. | |