deepbattler / README.md
lbtwyk
Update README to focus on RL training pipeline
fed1ca7
|
raw
history blame
6 kB
---
title: DeepBattler-RL
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
---
# DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds
This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making.
## Overview
DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.
**Key Features:**
- **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
- **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO
- **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT
- **FastAPI Inference Server** - Production-ready API for action generation
- **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment
## Project Structure
```
DeepBattler-RL/
β”œβ”€β”€ RL/ # Core RL training & evaluation
β”‚ β”œβ”€β”€ train_battleground_rlaif.py # SFT + GRPO training pipeline
β”‚ β”œβ”€β”€ train_battleground_rlaif_gamehistory.py # Training with game history context
β”‚ β”œβ”€β”€ eval_battleground_rlaif.py # Evaluation scripts
β”‚ β”œβ”€β”€ infer_battleground_cloud.py # Cloud inference utilities
β”‚ β”œβ”€β”€ battleground_nl_utils.py # Game state to natural language conversion
β”‚ └── datasets/ # Training data (JSONL format)
β”œβ”€β”€ app.py # FastAPI inference server
β”œβ”€β”€ Dockerfile # Docker deployment config
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ Agent/ # LLM agent callers (OpenAI, Gemma)
└── DeepBattlerPlugin/ # HDT plugin for game state extraction
```
## Quick Start
### Installation
```bash
pip install -r requirements.txt
```
**Requirements:**
- Python 3.10+
- PyTorch >= 2.1.0
- CUDA (recommended for training)
### Running the Inference Server
```bash
uvicorn app:app --host 0.0.0.0 --port 7860
```
The server loads:
- **Base Model:** `Qwen/Qwen3-4B-Instruct-2507`
- **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo`
### API Usage
**POST `/generate_actions`**
```json
{
"phase": "PlayerTurn",
"turn": 5,
"state": {
"game_state": { ... },
"tavern": [ ... ],
"hand": [ ... ],
"board": [ ... ]
},
"max_new_tokens": 256,
"temperature": 0.2
}
```
**Response:**
```json
{
"actions": [
{"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
{"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
{"type": "END_TURN"}
],
"raw_completion": "..."
}
```
## Training
### Dataset Format
Training data is stored in JSONL format under `RL/datasets/`:
```json
{
"game_id": "...",
"step_id": 0,
"turn": 3,
"phase": "PlayerTurn",
"state": { ... },
"candidates": [
{"role": "expert", "action": {...}, "reward": 1.0},
{"role": "medium", "action": {...}, "reward": 0.5},
{"role": "bad", "action": {...}, "reward": -0.5}
]
}
```
Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.
### Running Training
**SFT + GRPO Pipeline:**
```bash
python RL/train_battleground_rlaif.py \
--model Qwen/Qwen3-4B-Instruct \
--data RL/datasets/battleground_rlaif_multicandidate.jsonl \
--output ./battleground_rlaif_qwen \
--sft_epochs 3 \
--grpo_epochs 3
```
**With Game History Context:**
```bash
python RL/train_battleground_rlaif_gamehistory.py \
--model Qwen/Qwen3-4B-Instruct \
--output ./battleground_rlaif_qwen_gamehistory
```
### Training Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path |
| `--sft_epochs` | 3 | SFT training epochs |
| `--grpo_epochs` | 3 | GRPO training epochs |
| `--per_device_batch_size` | 4 | Batch size per GPU |
| `--sft_learning_rate` | 1e-5 | SFT learning rate |
| `--grpo_learning_rate` | 5e-6 | GRPO learning rate |
| `--max_seq_length` | 1024 | Maximum sequence length |
| `--skip_sft` | False | Skip SFT phase |
| `--skip_grpo` | False | Skip GRPO phase |
## Docker Deployment
```bash
docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl
```
For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.
## Action Types
The model outputs JSON action sequences with these action types:
| Action Type | Description |
|-------------|-------------|
| `BUY_FROM_TAVERN` | Purchase a minion from the tavern |
| `PLAY_FROM_HAND` | Play a minion from hand to board |
| `SELL_FROM_BOARD` | Sell a minion from the board |
| `HERO_POWER` | Activate hero power |
| `ROLL` | Refresh the tavern |
| `UPGRADE_TAVERN` | Upgrade tavern tier |
| `FREEZE` | Freeze the current tavern |
| `END_TURN` | End the current turn |
## Related Components
- **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON
- **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)
For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler).
## License
This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.
---