mindchain's picture
Upload README.md with huggingface_hub
bbf1e64 verified
---
title: RLM Arithmetic Training
emoji: πŸ”’
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
hardware: t4-small
---
# GRPO + RLVR Arithmetic Training
Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR.
## Task
Solve arithmetic problems like:
- 47 + 35 = 82
- 92 - 17 = 75
## Approach
- **Model:** Qwen/Qwen3-0.6B-Base
- **Method:** GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards)
- **Reward:** Exact match on answer
- **Steps:** 50
## Expected Results
Base model (no math training) should perform poorly (<10%), trained model should improve significantly.