---
title: DeepBattler-RL
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds

This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making.

## Overview

DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.

**Key Features:**
- **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
- **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO
- **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT
- **FastAPI Inference Server** - Production-ready API for action generation
- **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment

## Project Structure

```
DeepBattler-RL/
├── RL/                                    # Core RL training & evaluation
│   ├── train_battleground_rlaif.py        # SFT + GRPO training pipeline
│   ├── train_battleground_rlaif_gamehistory.py  # Training with game history context
│   ├── eval_battleground_rlaif.py         # Evaluation scripts
│   ├── infer_battleground_cloud.py        # Cloud inference utilities
│   ├── battleground_nl_utils.py           # Game state to natural language conversion
│   └── datasets/                          # Training data (JSONL format)
├── app.py                                 # FastAPI inference server
├── Dockerfile                             # Docker deployment config
├── requirements.txt                       # Python dependencies
├── Agent/                                 # LLM agent callers (OpenAI, Gemma)
└── DeepBattlerPlugin/                     # HDT plugin for game state extraction
```

## Quick Start

### Installation

```bash
pip install -r requirements.txt
```

**Requirements:**
- Python 3.10+
- PyTorch >= 2.1.0
- CUDA (recommended for training)

### Running the Inference Server

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```

The server loads:
- **Base Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo`

### API Usage

**POST `/generate_actions`**

```json
{
  "phase": "PlayerTurn",
  "turn": 5,
  "state": {
    "game_state": { ... },
    "tavern": [ ... ],
    "hand": [ ... ],
    "board": [ ... ]
  },
  "max_new_tokens": 256,
  "temperature": 0.2
}
```

**Response:**
```json
{
  "actions": [
    {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
    {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
    {"type": "END_TURN"}
  ],
  "raw_completion": "..."
}
```

## Training

### Dataset Format

Training data is stored in JSONL format under `RL/datasets/`:

```json
{
  "game_id": "...",
  "step_id": 0,
  "turn": 3,
  "phase": "PlayerTurn",
  "state": { ... },
  "candidates": [
    {"role": "expert", "action": {...}, "reward": 1.0},
    {"role": "medium", "action": {...}, "reward": 0.5},
    {"role": "bad", "action": {...}, "reward": -0.5}
  ]
}
```

Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.

### Running Training

**SFT + GRPO Pipeline:**

```bash
python RL/train_battleground_rlaif.py \
  --model Qwen/Qwen3-4B-Instruct \
  --data RL/datasets/battleground_rlaif_multicandidate.jsonl \
  --output ./battleground_rlaif_qwen \
  --sft_epochs 3 \
  --grpo_epochs 3
```

**With Game History Context:**

```bash
python RL/train_battleground_rlaif_gamehistory.py \
  --model Qwen/Qwen3-4B-Instruct \
  --output ./battleground_rlaif_qwen_gamehistory
```

### Training Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path |
| `--sft_epochs` | 3 | SFT training epochs |
| `--grpo_epochs` | 3 | GRPO training epochs |
| `--per_device_batch_size` | 4 | Batch size per GPU |
| `--sft_learning_rate` | 1e-5 | SFT learning rate |
| `--grpo_learning_rate` | 5e-6 | GRPO learning rate |
| `--max_seq_length` | 1024 | Maximum sequence length |
| `--skip_sft` | False | Skip SFT phase |
| `--skip_grpo` | False | Skip GRPO phase |

## Docker Deployment

```bash
docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl
```

For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.

## Action Types

The model outputs JSON action sequences with these action types:

| Action Type | Description |
|-------------|-------------|
| `BUY_FROM_TAVERN` | Purchase a minion from the tavern |
| `PLAY_FROM_HAND` | Play a minion from hand to board |
| `SELL_FROM_BOARD` | Sell a minion from the board |
| `HERO_POWER` | Activate hero power |
| `ROLL` | Refresh the tavern |
| `UPGRADE_TAVERN` | Upgrade tavern tier |
| `FREEZE` | Freeze the current tavern |
| `END_TURN` | End the current turn |

## Related Components

- **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON
- **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)

For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler).

## License

This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.

---