File size: 8,226 Bytes

00db46c

# GRPO Countdown Problem

A project for training language models to solve arithmetic countdown problems using Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO).

## Overview

This project implements a two-stage training pipeline:

1. **SFT (Supervised Fine-Tuning)**: Train the model on arithmetic problems with correct solutions
2. **GRPO (Group Relative Policy Optimization)**: Further optimize the model using reward-based learning

The goal is to train a language model to solve arithmetic countdown problems where you must use exactly four given numbers with basic arithmetic operations (+, -, *, /) to reach a target value.



## Project Structure



```

grpo-countdown-problem/

├── data/                           # Training and test datasets

├── models/                         # Saved model checkpoints

│   ├── sft/                       # SFT model outputs

│   └── grpo/                      # GRPO model outputs

├── src/

│   ├── config/                    # Configuration files

│   │   ├── grpo/                  # GRPO training configs

│   │   └── sft/                   # SFT training configs

│   ├── dataset/                   # Dataset loading and processing

│   ├── examples/                  # Example scripts for inference

│   ├── scripts/                   # Data generation and processing

│   ├── training/                  # Training scripts

│   │   ├── grpo/                  # GRPO training

│   │   └── sft/                   # SFT training

│   └── utils/                     # Utility functions

├── main.py                        # Main entry point

├── pyproject.toml                 # Project dependencies

└── README.md                      # This file

```



## Requirements



- Python 3.12+

- CUDA-capable GPU (recommended)

- At least 8GB GPU memory for Qwen2.5-Math-1.5B model



## Installation



1. **Clone the repository:**

   ```bash

   git clone <repository-url>

   cd grpo-countdown-problem

   ```



2. **Install dependencies using uv (recommended):**

   ```bash

   # Install uv if you haven't already

   curl -LsSf https://astral.sh/uv/install.sh | sh

   

   # Install project dependencies

   uv sync

   ```



   **Or using pip:**

   ```bash

   pip install -e .

   ```



3. **Set up environment variables (if using OpenAI for data generation):**

   ```bash

   cp .env.example .env

   # Edit .env and add your OpenAI API key

   ```



## Data Preparation



### Generate Training Data



1. **Generate SFT training data:**

   ```bash

   python src/scripts/generate_training_dataset_sft.py \

     --output_path data/sft/train.csv \

     --num_problems 10000 \

     --num_workers 4

   ```



2. **Generate GRPO training data:**

   ```bash

   python src/scripts/generate_training_dataset_grpo.py \

     --output_path data/grpo/train.csv \

     --num_problems 10000 \

     --num_workers 4

   ```



3. **Generate test data:**

   ```bash

   python src/scripts/generate_training_dataset_grpo.py \

     --output_path data/grpo/test.csv \

     --num_problems 1000 \

     --num_workers 4

   ```



### Data Format



The CSV files contain the following columns:

- `id`: Unique problem identifier

- `problem_description`: Natural language description of the problem

- `correct_answer`: The target arithmetic expression

- `num1`, `num2`, `num3`, `num4`: The four numbers to use

- `reasoning` (SFT only): Step-by-step solution explanation



## Training



### Stage 1: Supervised Fine-Tuning (SFT)



Train the base model on arithmetic problems with supervised learning:



```bash

python src/training/sft/train_sft_hydra.py

```



**Configuration:** The training uses Hydra configuration files in `src/config/sft/`:

- `config.yaml`: Main configuration

- `dataset/default.yaml`: Dataset settings

- `model/qwen2.5-3b.yaml`: Model and LoRA settings

- `training/default.yaml`: Training hyperparameters



**Key parameters:**

- Base model: `Qwen/Qwen2.5-Math-1.5B`

- LoRA rank: 64

- Learning rate: 2e-5

- Batch size: 4 (per device)

- Epochs: 2



**Output:** Trained SFT model saved to `models/sft/`



### Stage 2: Group Relative Policy Optimization (GRPO)



Further optimize the SFT model using reward-based learning:



```bash

python src/training/grpo/train_grpo_hydra.py

```



**Configuration:** Uses Hydra configuration files in `src/config/grpo/`:

- `config.yaml`: Main configuration (includes SFT model path)

- `dataset/default.yaml`: Dataset settings

- `model/qwen2.5-3b.yaml`: Model and LoRA settings  

- `training/default.yaml`: Training hyperparameters



**Key parameters:**

- Builds on SFT model from `models/sft/`

- Learning rate: 1e-5

- Batch size: 2 (per device)

- Epochs: 1

- Generations per prompt: 8

- Reward function: Mathematical correctness



**Output:** Trained GRPO model saved to `models/grpo/`



### Custom Configuration



You can override configuration parameters:



```bash

# Override dataset size

python src/training/sft/train_sft_hydra.py dataset.max_rows=5000



# Override learning rate and batch size

python src/training/grpo/train_grpo_hydra.py \

  training.learning_rate=5e-6 \

  training.per_device_train_batch_size=1



# Use different output directory

python src/training/sft/train_sft_hydra.py output_dir=models/sft_experiment

```



## Inference



### Interactive Problem Solving



Use the trained model to solve individual problems:



```bash

python src/examples/run_model.py

```



This will load both SFT and GRPO models and solve a sample problem.



### Batch Evaluation



Evaluate model accuracy on a test dataset:



```bash

python src/examples/calculate_accuracy.py \

  --csv_path data/grpo/test.csv \

  --sft_model_path models/sft/ \

  --grpo_model_path models/grpo/ \

  --max_samples 100 \

  --output_path results/evaluation_results.csv

```



**Parameters:**

- `--csv_path`: Path to test CSV file

- `--sft_model_path`: Path to SFT model directory

- `--grpo_model_path`: Path to GRPO model directory

- `--max_samples`: Limit number of test samples (optional)

- `--output_path`: Save detailed results to CSV (optional)

- `--temperature`: Sampling temperature (default: 1.0)

- `--max_new_tokens`: Maximum tokens to generate (default: 4096)



**Evaluation Metrics:**

- **Accuracy**: Percentage of problems solved correctly

- **Valid Format Rate**: Percentage of responses in valid arithmetic format

- **Uses All Numbers Rate**: Percentage of responses using all four numbers



### Model-only Evaluation



Evaluate specific model stages:



```bash

# Evaluate only SFT model (no GRPO)

python src/examples/calculate_accuracy.py \

  --csv_path data/grpo/test.csv \

  --sft_model_path models/sft/ \

  --no_grpo



# Evaluate only base model (no SFT or GRPO)

python src/examples/calculate_accuracy.py \

  --csv_path data/grpo/test.csv \

  --no_sft --no_grpo

```



## Configuration Details



### Model Configuration



The project uses **Qwen2.5-Math-1.5B** as the base model with LoRA (Low-Rank Adaptation) for efficient fine-tuning:



- **LoRA rank**: 64

- **LoRA alpha**: 128

- **Target modules**: All attention and MLP layers

- **LoRA dropout**: 0.1



### Training Configuration



**SFT Training:**

- **Optimizer**: AdamW 8-bit

- **Learning rate**: 2e-5 with linear scheduler

- **Warmup ratio**: 0.1

- **Weight decay**: 0.01

- **Max sequence length**: 4096



**GRPO Training:**

- **Optimizer**: AdamW 8-bit

- **Learning rate**: 1e-5 with cosine scheduler

- **Warmup ratio**: 0.1

- **Weight decay**: 0.0

- **Temperature**: 1.0

- **Generations per prompt**: 8



## Monitoring Training



Both training scripts log to TensorBoard:



```bash

# View training logs

tensorboard --logdir models/sft/runs    # For SFT training

tensorboard --logdir models/grpo/runs   # For GRPO training

```



## Example Problem



**Input:** "Use 53, 3, 47, and 36 exactly once each with only +, -, *, and / operators to create an expression equal to 133."

**Expected Output:** A valid arithmetic expression like `53 + 47 + 36 - 3`