deepbattler / README.md
lbtwyk
Update README to focus on RL training pipeline
fed1ca7
metadata
title: DeepBattler-RL
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds

This repository contains the RL training and inference pipeline for DeepBattler, combining RLHF (Reinforcement Learning from Human Feedback) on human expert actions with RLAIF optimized using GRPO (Group Relative Policy Optimization) to train a policy model for Hearthstone Battlegrounds decision-making.

Overview

DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.

Key Features:

  • SFT + GRPO Training Pipeline - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
  • RLHF + Multi-Candidate RLAIF - Human expert actions as expert candidates plus additional medium/bad actions for preference-based GRPO
  • LoRA Fine-tuning - Efficient parameter-efficient training with PEFT
  • FastAPI Inference Server - Production-ready API for action generation
  • Docker Deployment - Ready for HuggingFace Spaces or self-hosted deployment

Project Structure

DeepBattler-RL/
β”œβ”€β”€ RL/                                    # Core RL training & evaluation
β”‚   β”œβ”€β”€ train_battleground_rlaif.py        # SFT + GRPO training pipeline
β”‚   β”œβ”€β”€ train_battleground_rlaif_gamehistory.py  # Training with game history context
β”‚   β”œβ”€β”€ eval_battleground_rlaif.py         # Evaluation scripts
β”‚   β”œβ”€β”€ infer_battleground_cloud.py        # Cloud inference utilities
β”‚   β”œβ”€β”€ battleground_nl_utils.py           # Game state to natural language conversion
β”‚   └── datasets/                          # Training data (JSONL format)
β”œβ”€β”€ app.py                                 # FastAPI inference server
β”œβ”€β”€ Dockerfile                             # Docker deployment config
β”œβ”€β”€ requirements.txt                       # Python dependencies
β”œβ”€β”€ Agent/                                 # LLM agent callers (OpenAI, Gemma)
└── DeepBattlerPlugin/                     # HDT plugin for game state extraction

Quick Start

Installation

pip install -r requirements.txt

Requirements:

  • Python 3.10+
  • PyTorch >= 2.1.0
  • CUDA (recommended for training)

Running the Inference Server

uvicorn app:app --host 0.0.0.0 --port 7860

The server loads:

  • Base Model: Qwen/Qwen3-4B-Instruct-2507
  • LoRA Adapter: iteratehack/battleground-rlaif-qwen-gamehistory-grpo

API Usage

POST /generate_actions

{
  "phase": "PlayerTurn",
  "turn": 5,
  "state": {
    "game_state": { ... },
    "tavern": [ ... ],
    "hand": [ ... ],
    "board": [ ... ]
  },
  "max_new_tokens": 256,
  "temperature": 0.2
}

Response:

{
  "actions": [
    {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
    {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
    {"type": "END_TURN"}
  ],
  "raw_completion": "..."
}

Training

Dataset Format

Training data is stored in JSONL format under RL/datasets/:

{
  "game_id": "...",
  "step_id": 0,
  "turn": 3,
  "phase": "PlayerTurn",
  "state": { ... },
  "candidates": [
    {"role": "expert", "action": {...}, "reward": 1.0},
    {"role": "medium", "action": {...}, "reward": 0.5},
    {"role": "bad", "action": {...}, "reward": -0.5}
  ]
}

Here the expert role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.

Running Training

SFT + GRPO Pipeline:

python RL/train_battleground_rlaif.py \
  --model Qwen/Qwen3-4B-Instruct \
  --data RL/datasets/battleground_rlaif_multicandidate.jsonl \
  --output ./battleground_rlaif_qwen \
  --sft_epochs 3 \
  --grpo_epochs 3

With Game History Context:

python RL/train_battleground_rlaif_gamehistory.py \
  --model Qwen/Qwen3-4B-Instruct \
  --output ./battleground_rlaif_qwen_gamehistory

Training Configuration

Parameter Default Description
--model Qwen/Qwen3-4B-Instruct Base model path
--sft_epochs 3 SFT training epochs
--grpo_epochs 3 GRPO training epochs
--per_device_batch_size 4 Batch size per GPU
--sft_learning_rate 1e-5 SFT learning rate
--grpo_learning_rate 5e-6 GRPO learning rate
--max_seq_length 1024 Maximum sequence length
--skip_sft False Skip SFT phase
--skip_grpo False Skip GRPO phase

Docker Deployment

docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl

For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.

Action Types

The model outputs JSON action sequences with these action types:

Action Type Description
BUY_FROM_TAVERN Purchase a minion from the tavern
PLAY_FROM_HAND Play a minion from hand to board
SELL_FROM_BOARD Sell a minion from the board
HERO_POWER Activate hero power
ROLL Refresh the tavern
UPGRADE_TAVERN Upgrade tavern tier
FREEZE Freeze the current tavern
END_TURN End the current turn

Related Components

  • DeepBattlerPlugin/ - C# HDT plugin that extracts game state to JSON
  • Agent/ - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)

For the full DeepBattler experience with HDT integration, see the main DeepBattler repository.

License

This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.