File size: 673 Bytes
a80cc87
bbf1e64
 
 
 
a80cc87
 
bbf1e64
a80cc87
 
bbf1e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
title: RLM Arithmetic Training
emoji: πŸ”’
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
hardware: t4-small
---

# GRPO + RLVR Arithmetic Training

Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR.

## Task

Solve arithmetic problems like:
- 47 + 35 = 82
- 92 - 17 = 75

## Approach

- **Model:** Qwen/Qwen3-0.6B-Base
- **Method:** GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards)
- **Reward:** Exact match on answer
- **Steps:** 50

## Expected Results

Base model (no math training) should perform poorly (<10%), trained model should improve significantly.