Spaces:

iteratehack
/

deepbattler

Sleeping

File size: 6,000 Bytes

dfe5fb8
fed1ca7
dfe5fb8
 
 
 
 
 
 
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
787c99c
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
 
 
787c99c
fed1ca7
 
 
 
787c99c
fed1ca7
787c99c
 
fed1ca7
787c99c
 
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
787c99c
 
fed1ca7
 
 
 
 
 
 
 
 
 
787c99c
 
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b0482e
fed1ca7
787c99c
fed1ca7
787c99c
 
fed1ca7
 
 
 
 
 
787c99c
 
fed1ca7
787c99c
fed1ca7
 
 
 
787c99c
 
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
787c99c
 
fed1ca7
 
787c99c
 
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
 
 
 
 
787c99c
fed1ca7
787c99c
fed1ca7
 
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7

---
title: DeepBattler-RL
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds

This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making.

## Overview

DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.

**Key Features:**
- **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
- **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO
- **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT
- **FastAPI Inference Server** - Production-ready API for action generation
- **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment

## Project Structure

```
DeepBattler-RL/
├── RL/                                    # Core RL training & evaluation
│   ├── train_battleground_rlaif.py        # SFT + GRPO training pipeline
│   ├── train_battleground_rlaif_gamehistory.py  # Training with game history context
│   ├── eval_battleground_rlaif.py         # Evaluation scripts
│   ├── infer_battleground_cloud.py        # Cloud inference utilities
│   ├── battleground_nl_utils.py           # Game state to natural language conversion
│   └── datasets/                          # Training data (JSONL format)
├── app.py                                 # FastAPI inference server
├── Dockerfile                             # Docker deployment config
├── requirements.txt                       # Python dependencies
├── Agent/                                 # LLM agent callers (OpenAI, Gemma)
└── DeepBattlerPlugin/                     # HDT plugin for game state extraction
```

## Quick Start

### Installation

```bash
pip install -r requirements.txt
```

**Requirements:**
- Python 3.10+
- PyTorch >= 2.1.0
- CUDA (recommended for training)

### Running the Inference Server

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```

The server loads:
- **Base Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo`

### API Usage

**POST `/generate_actions`**

```json
{
  "phase": "PlayerTurn",
  "turn": 5,
  "state": {
    "game_state": { ... },
    "tavern": [ ... ],
    "hand": [ ... ],
    "board": [ ... ]
  },
  "max_new_tokens": 256,
  "temperature": 0.2
}
```

**Response:**
```json
{
  "actions": [
    {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
    {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
    {"type": "END_TURN"}
  ],
  "raw_completion": "..."
}
```

## Training

### Dataset Format

Training data is stored in JSONL format under `RL/datasets/`:

```json
{
  "game_id": "...",
  "step_id": 0,
  "turn": 3,
  "phase": "PlayerTurn",
  "state": { ... },
  "candidates": [
    {"role": "expert", "action": {...}, "reward": 1.0},
    {"role": "medium", "action": {...}, "reward": 0.5},
    {"role": "bad", "action": {...}, "reward": -0.5}
  ]
}
```

Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.

### Running Training

**SFT + GRPO Pipeline:**

```bash
python RL/train_battleground_rlaif.py \
  --model Qwen/Qwen3-4B-Instruct \
  --data RL/datasets/battleground_rlaif_multicandidate.jsonl \
  --output ./battleground_rlaif_qwen \
  --sft_epochs 3 \
  --grpo_epochs 3
```

**With Game History Context:**

```bash
python RL/train_battleground_rlaif_gamehistory.py \
  --model Qwen/Qwen3-4B-Instruct \
  --output ./battleground_rlaif_qwen_gamehistory
```

### Training Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path |
| `--sft_epochs` | 3 | SFT training epochs |
| `--grpo_epochs` | 3 | GRPO training epochs |
| `--per_device_batch_size` | 4 | Batch size per GPU |
| `--sft_learning_rate` | 1e-5 | SFT learning rate |
| `--grpo_learning_rate` | 5e-6 | GRPO learning rate |
| `--max_seq_length` | 1024 | Maximum sequence length |
| `--skip_sft` | False | Skip SFT phase |
| `--skip_grpo` | False | Skip GRPO phase |

## Docker Deployment

```bash
docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl
```

For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.

## Action Types

The model outputs JSON action sequences with these action types:

| Action Type | Description |
|-------------|-------------|
| `BUY_FROM_TAVERN` | Purchase a minion from the tavern |
| `PLAY_FROM_HAND` | Play a minion from hand to board |
| `SELL_FROM_BOARD` | Sell a minion from the board |
| `HERO_POWER` | Activate hero power |
| `ROLL` | Refresh the tavern |
| `UPGRADE_TAVERN` | Upgrade tavern tier |
| `FREEZE` | Freeze the current tavern |
| `END_TURN` | End the current turn |

## Related Components

- **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON
- **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)

For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler).

## License

This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.

---