Spaces:

iteratehack
/

deepbattler

Sleeping

App Files Files Community

deepbattler / README.md

lbtwyk

Update README to focus on RL training pipeline

fed1ca7 13 days ago

preview code

raw

history blame contribute delete

6 kB

metadata

title: DeepBattler-RL
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false

DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds

This repository contains the RL training and inference pipeline for DeepBattler, combining RLHF (Reinforcement Learning from Human Feedback) on human expert actions with RLAIF optimized using GRPO (Group Relative Policy Optimization) to train a policy model for Hearthstone Battlegrounds decision-making.

Overview

DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.

Key Features:

SFT + GRPO Training Pipeline - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
RLHF + Multi-Candidate RLAIF - Human expert actions as expert candidates plus additional medium/bad actions for preference-based GRPO
LoRA Fine-tuning - Efficient parameter-efficient training with PEFT
FastAPI Inference Server - Production-ready API for action generation
Docker Deployment - Ready for HuggingFace Spaces or self-hosted deployment

Project Structure

DeepBattler-RL/
├── RL/                                    # Core RL training & evaluation
│   ├── train_battleground_rlaif.py        # SFT + GRPO training pipeline
│   ├── train_battleground_rlaif_gamehistory.py  # Training with game history context
│   ├── eval_battleground_rlaif.py         # Evaluation scripts
│   ├── infer_battleground_cloud.py        # Cloud inference utilities
│   ├── battleground_nl_utils.py           # Game state to natural language conversion
│   └── datasets/                          # Training data (JSONL format)
├── app.py                                 # FastAPI inference server
├── Dockerfile                             # Docker deployment config
├── requirements.txt                       # Python dependencies
├── Agent/                                 # LLM agent callers (OpenAI, Gemma)
└── DeepBattlerPlugin/                     # HDT plugin for game state extraction

Quick Start

Installation

pip install -r requirements.txt

Requirements:

Python 3.10+
PyTorch >= 2.1.0
CUDA (recommended for training)

Running the Inference Server

uvicorn app:app --host 0.0.0.0 --port 7860

The server loads:

Base Model: Qwen/Qwen3-4B-Instruct-2507
LoRA Adapter: iteratehack/battleground-rlaif-qwen-gamehistory-grpo

API Usage

POST /generate_actions

{
  "phase": "PlayerTurn",
  "turn": 5,
  "state": {
    "game_state": { ... },
    "tavern": [ ... ],
    "hand": [ ... ],
    "board": [ ... ]
  },
  "max_new_tokens": 256,
  "temperature": 0.2
}

Response:

{
  "actions": [
    {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
    {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
    {"type": "END_TURN"}
  ],
  "raw_completion": "..."
}

Training

Dataset Format

Training data is stored in JSONL format under RL/datasets/:

{
  "game_id": "...",
  "step_id": 0,
  "turn": 3,
  "phase": "PlayerTurn",
  "state": { ... },
  "candidates": [
    {"role": "expert", "action": {...}, "reward": 1.0},
    {"role": "medium", "action": {...}, "reward": 0.5},
    {"role": "bad", "action": {...}, "reward": -0.5}
  ]
}

Here the expert role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.

Running Training

SFT + GRPO Pipeline:

python RL/train_battleground_rlaif.py \
  --model Qwen/Qwen3-4B-Instruct \
  --data RL/datasets/battleground_rlaif_multicandidate.jsonl \
  --output ./battleground_rlaif_qwen \
  --sft_epochs 3 \
  --grpo_epochs 3

With Game History Context:

python RL/train_battleground_rlaif_gamehistory.py \
  --model Qwen/Qwen3-4B-Instruct \
  --output ./battleground_rlaif_qwen_gamehistory

Training Configuration

Parameter	Default	Description
`--model`	`Qwen/Qwen3-4B-Instruct`	Base model path
`--sft_epochs`	3	SFT training epochs
`--grpo_epochs`	3	GRPO training epochs
`--per_device_batch_size`	4	Batch size per GPU
`--sft_learning_rate`	1e-5	SFT learning rate
`--grpo_learning_rate`	5e-6	GRPO learning rate
`--max_seq_length`	1024	Maximum sequence length
`--skip_sft`	False	Skip SFT phase
`--skip_grpo`	False	Skip GRPO phase

Docker Deployment

docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl

For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.

Action Types

The model outputs JSON action sequences with these action types:

Action Type	Description
`BUY_FROM_TAVERN`	Purchase a minion from the tavern
`PLAY_FROM_HAND`	Play a minion from hand to board
`SELL_FROM_BOARD`	Sell a minion from the board
`HERO_POWER`	Activate hero power
`ROLL`	Refresh the tavern
`UPGRADE_TAVERN`	Upgrade tavern tier
`FREEZE`	Freeze the current tavern
`END_TURN`	End the current turn

Related Components

DeepBattlerPlugin/ - C# HDT plugin that extracts game state to JSON
Agent/ - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)

For the full DeepBattler experience with HDT integration, see the main DeepBattler repository.

License

This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.