mindchain's picture
Upload README.md with huggingface_hub
bbf1e64 verified
metadata
title: RLM Arithmetic Training
emoji: πŸ”’
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
hardware: t4-small

GRPO + RLVR Arithmetic Training

Training Qwen3-0.6B-Base on simple arithmetic (2-digit addition/subtraction) using GRPO + RLVR.

Task

Solve arithmetic problems like:

  • 47 + 35 = 82
  • 92 - 17 = 75

Approach

  • Model: Qwen/Qwen3-0.6B-Base
  • Method: GRPO (Group Relative Policy Optimization) with RLVR (Reinforcement Learning with Verifiable Rewards)
  • Reward: Exact match on answer
  • Steps: 50

Expected Results

Base model (no math training) should perform poorly (<10%), trained model should improve significantly.