File size: 6,000 Bytes
dfe5fb8
fed1ca7
dfe5fb8
 
 
 
 
 
 
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
787c99c
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
 
 
787c99c
fed1ca7
 
 
 
787c99c
fed1ca7
787c99c
 
fed1ca7
787c99c
 
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
787c99c
 
fed1ca7
 
 
 
 
 
 
 
 
 
787c99c
 
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b0482e
fed1ca7
787c99c
fed1ca7
787c99c
 
fed1ca7
 
 
 
 
 
787c99c
 
fed1ca7
787c99c
fed1ca7
 
 
 
787c99c
 
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
 
 
 
 
 
 
 
787c99c
 
fed1ca7
 
787c99c
 
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
 
 
 
 
 
 
 
 
 
787c99c
fed1ca7
787c99c
fed1ca7
 
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
787c99c
fed1ca7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
title: DeepBattler-RL
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---

# DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds

This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making.

## Overview

DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.

**Key Features:**
- **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
- **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO
- **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT
- **FastAPI Inference Server** - Production-ready API for action generation
- **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment

## Project Structure

```
DeepBattler-RL/
β”œβ”€β”€ RL/                                    # Core RL training & evaluation
β”‚   β”œβ”€β”€ train_battleground_rlaif.py        # SFT + GRPO training pipeline
β”‚   β”œβ”€β”€ train_battleground_rlaif_gamehistory.py  # Training with game history context
β”‚   β”œβ”€β”€ eval_battleground_rlaif.py         # Evaluation scripts
β”‚   β”œβ”€β”€ infer_battleground_cloud.py        # Cloud inference utilities
β”‚   β”œβ”€β”€ battleground_nl_utils.py           # Game state to natural language conversion
β”‚   └── datasets/                          # Training data (JSONL format)
β”œβ”€β”€ app.py                                 # FastAPI inference server
β”œβ”€β”€ Dockerfile                             # Docker deployment config
β”œβ”€β”€ requirements.txt                       # Python dependencies
β”œβ”€β”€ Agent/                                 # LLM agent callers (OpenAI, Gemma)
└── DeepBattlerPlugin/                     # HDT plugin for game state extraction
```

## Quick Start

### Installation

```bash
pip install -r requirements.txt
```

**Requirements:**
- Python 3.10+
- PyTorch >= 2.1.0
- CUDA (recommended for training)

### Running the Inference Server

```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```

The server loads:
- **Base Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo`

### API Usage

**POST `/generate_actions`**

```json
{
  "phase": "PlayerTurn",
  "turn": 5,
  "state": {
    "game_state": { ... },
    "tavern": [ ... ],
    "hand": [ ... ],
    "board": [ ... ]
  },
  "max_new_tokens": 256,
  "temperature": 0.2
}
```

**Response:**
```json
{
  "actions": [
    {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
    {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
    {"type": "END_TURN"}
  ],
  "raw_completion": "..."
}
```

## Training

### Dataset Format

Training data is stored in JSONL format under `RL/datasets/`:

```json
{
  "game_id": "...",
  "step_id": 0,
  "turn": 3,
  "phase": "PlayerTurn",
  "state": { ... },
  "candidates": [
    {"role": "expert", "action": {...}, "reward": 1.0},
    {"role": "medium", "action": {...}, "reward": 0.5},
    {"role": "bad", "action": {...}, "reward": -0.5}
  ]
}
```

Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.

### Running Training

**SFT + GRPO Pipeline:**

```bash
python RL/train_battleground_rlaif.py \
  --model Qwen/Qwen3-4B-Instruct \
  --data RL/datasets/battleground_rlaif_multicandidate.jsonl \
  --output ./battleground_rlaif_qwen \
  --sft_epochs 3 \
  --grpo_epochs 3
```

**With Game History Context:**

```bash
python RL/train_battleground_rlaif_gamehistory.py \
  --model Qwen/Qwen3-4B-Instruct \
  --output ./battleground_rlaif_qwen_gamehistory
```

### Training Configuration

| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path |
| `--sft_epochs` | 3 | SFT training epochs |
| `--grpo_epochs` | 3 | GRPO training epochs |
| `--per_device_batch_size` | 4 | Batch size per GPU |
| `--sft_learning_rate` | 1e-5 | SFT learning rate |
| `--grpo_learning_rate` | 5e-6 | GRPO learning rate |
| `--max_seq_length` | 1024 | Maximum sequence length |
| `--skip_sft` | False | Skip SFT phase |
| `--skip_grpo` | False | Skip GRPO phase |

## Docker Deployment

```bash
docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl
```

For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.

## Action Types

The model outputs JSON action sequences with these action types:

| Action Type | Description |
|-------------|-------------|
| `BUY_FROM_TAVERN` | Purchase a minion from the tavern |
| `PLAY_FROM_HAND` | Play a minion from hand to board |
| `SELL_FROM_BOARD` | Sell a minion from the board |
| `HERO_POWER` | Activate hero power |
| `ROLL` | Refresh the tavern |
| `UPGRADE_TAVERN` | Upgrade tavern tier |
| `FREEZE` | Freeze the current tavern |
| `END_TURN` | End the current turn |

## Related Components

- **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON
- **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)

For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler).

## License

This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.

---