Spaces:
Sleeping
title: DeepBattler-RL
emoji: π§
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds
This repository contains the RL training and inference pipeline for DeepBattler, combining RLHF (Reinforcement Learning from Human Feedback) on human expert actions with RLAIF optimized using GRPO (Group Relative Policy Optimization) to train a policy model for Hearthstone Battlegrounds decision-making.
Overview
DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.
Key Features:
- SFT + GRPO Training Pipeline - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
- RLHF + Multi-Candidate RLAIF - Human expert actions as
expertcandidates plus additional medium/bad actions for preference-based GRPO - LoRA Fine-tuning - Efficient parameter-efficient training with PEFT
- FastAPI Inference Server - Production-ready API for action generation
- Docker Deployment - Ready for HuggingFace Spaces or self-hosted deployment
Project Structure
DeepBattler-RL/
βββ RL/ # Core RL training & evaluation
β βββ train_battleground_rlaif.py # SFT + GRPO training pipeline
β βββ train_battleground_rlaif_gamehistory.py # Training with game history context
β βββ eval_battleground_rlaif.py # Evaluation scripts
β βββ infer_battleground_cloud.py # Cloud inference utilities
β βββ battleground_nl_utils.py # Game state to natural language conversion
β βββ datasets/ # Training data (JSONL format)
βββ app.py # FastAPI inference server
βββ Dockerfile # Docker deployment config
βββ requirements.txt # Python dependencies
βββ Agent/ # LLM agent callers (OpenAI, Gemma)
βββ DeepBattlerPlugin/ # HDT plugin for game state extraction
Quick Start
Installation
pip install -r requirements.txt
Requirements:
- Python 3.10+
- PyTorch >= 2.1.0
- CUDA (recommended for training)
Running the Inference Server
uvicorn app:app --host 0.0.0.0 --port 7860
The server loads:
- Base Model:
Qwen/Qwen3-4B-Instruct-2507 - LoRA Adapter:
iteratehack/battleground-rlaif-qwen-gamehistory-grpo
API Usage
POST /generate_actions
{
"phase": "PlayerTurn",
"turn": 5,
"state": {
"game_state": { ... },
"tavern": [ ... ],
"hand": [ ... ],
"board": [ ... ]
},
"max_new_tokens": 256,
"temperature": 0.2
}
Response:
{
"actions": [
{"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
{"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
{"type": "END_TURN"}
],
"raw_completion": "..."
}
Training
Dataset Format
Training data is stored in JSONL format under RL/datasets/:
{
"game_id": "...",
"step_id": 0,
"turn": 3,
"phase": "PlayerTurn",
"state": { ... },
"candidates": [
{"role": "expert", "action": {...}, "reward": 1.0},
{"role": "medium", "action": {...}, "reward": 0.5},
{"role": "bad", "action": {...}, "reward": -0.5}
]
}
Here the expert role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.
Running Training
SFT + GRPO Pipeline:
python RL/train_battleground_rlaif.py \
--model Qwen/Qwen3-4B-Instruct \
--data RL/datasets/battleground_rlaif_multicandidate.jsonl \
--output ./battleground_rlaif_qwen \
--sft_epochs 3 \
--grpo_epochs 3
With Game History Context:
python RL/train_battleground_rlaif_gamehistory.py \
--model Qwen/Qwen3-4B-Instruct \
--output ./battleground_rlaif_qwen_gamehistory
Training Configuration
| Parameter | Default | Description |
|---|---|---|
--model |
Qwen/Qwen3-4B-Instruct |
Base model path |
--sft_epochs |
3 | SFT training epochs |
--grpo_epochs |
3 | GRPO training epochs |
--per_device_batch_size |
4 | Batch size per GPU |
--sft_learning_rate |
1e-5 | SFT learning rate |
--grpo_learning_rate |
5e-6 | GRPO learning rate |
--max_seq_length |
1024 | Maximum sequence length |
--skip_sft |
False | Skip SFT phase |
--skip_grpo |
False | Skip GRPO phase |
Docker Deployment
docker build -t deepbattler-rl .
docker run -p 7860:7860 --gpus all deepbattler-rl
For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.
Action Types
The model outputs JSON action sequences with these action types:
| Action Type | Description |
|---|---|
BUY_FROM_TAVERN |
Purchase a minion from the tavern |
PLAY_FROM_HAND |
Play a minion from hand to board |
SELL_FROM_BOARD |
Sell a minion from the board |
HERO_POWER |
Activate hero power |
ROLL |
Refresh the tavern |
UPGRADE_TAVERN |
Upgrade tavern tier |
FREEZE |
Freeze the current tavern |
END_TURN |
End the current turn |
Related Components
- DeepBattlerPlugin/ - C# HDT plugin that extracts game state to JSON
- Agent/ - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)
For the full DeepBattler experience with HDT integration, see the main DeepBattler repository.
License
This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.