---
title: RLM Arithmetic Training
emoji: 🔢
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
hardware: t4-small
---

# GRPO + RLVR Arithmetic Training

Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR.

## Task

Solve arithmetic problems like:
- 47 + 35 = 82
- 92 - 17 = 75

## Approach

- **Model:** Qwen/Qwen3-0.6B-Base
- **Method:** GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards)
- **Reward:** Exact match on answer
- **Steps:** 50

## Expected Results

Base model (no math training) should perform poorly (<10%), trained model should improve significantly.