--- title: DeepBattler-RL emoji: 🧠 colorFrom: blue colorTo: purple sdk: docker pinned: false --- # DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds This repository contains the **RL training and inference pipeline** for DeepBattler, combining **RLHF (Reinforcement Learning from Human Feedback)** on human expert actions with **RLAIF** optimized using **GRPO (Group Relative Policy Optimization)** to train a policy model for Hearthstone Battlegrounds decision-making. ## Overview DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference. **Key Features:** - **SFT + GRPO Training Pipeline** - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step - **RLHF + Multi-Candidate RLAIF** - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO - **LoRA Fine-tuning** - Efficient parameter-efficient training with PEFT - **FastAPI Inference Server** - Production-ready API for action generation - **Docker Deployment** - Ready for HuggingFace Spaces or self-hosted deployment ## Project Structure ``` DeepBattler-RL/ ├── RL/ # Core RL training & evaluation │ ├── train_battleground_rlaif.py # SFT + GRPO training pipeline │ ├── train_battleground_rlaif_gamehistory.py # Training with game history context │ ├── eval_battleground_rlaif.py # Evaluation scripts │ ├── infer_battleground_cloud.py # Cloud inference utilities │ ├── battleground_nl_utils.py # Game state to natural language conversion │ └── datasets/ # Training data (JSONL format) ├── app.py # FastAPI inference server ├── Dockerfile # Docker deployment config ├── requirements.txt # Python dependencies ├── Agent/ # LLM agent callers (OpenAI, Gemma) └── DeepBattlerPlugin/ # HDT plugin for game state extraction ``` ## Quick Start ### Installation ```bash pip install -r requirements.txt ``` **Requirements:** - Python 3.10+ - PyTorch >= 2.1.0 - CUDA (recommended for training) ### Running the Inference Server ```bash uvicorn app:app --host 0.0.0.0 --port 7860 ``` The server loads: - **Base Model:** `Qwen/Qwen3-4B-Instruct-2507` - **LoRA Adapter:** `iteratehack/battleground-rlaif-qwen-gamehistory-grpo` ### API Usage **POST `/generate_actions`** ```json { "phase": "PlayerTurn", "turn": 5, "state": { "game_state": { ... }, "tavern": [ ... ], "hand": [ ... ], "board": [ ... ] }, "max_new_tokens": 256, "temperature": 0.2 } ``` **Response:** ```json { "actions": [ {"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"}, {"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0}, {"type": "END_TURN"} ], "raw_completion": "..." } ``` ## Training ### Dataset Format Training data is stored in JSONL format under `RL/datasets/`: ```json { "game_id": "...", "step_id": 0, "turn": 3, "phase": "PlayerTurn", "state": { ... }, "candidates": [ {"role": "expert", "action": {...}, "reward": 1.0}, {"role": "medium", "action": {...}, "reward": 0.5}, {"role": "bad", "action": {...}, "reward": -0.5} ] } ``` Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO. ### Running Training **SFT + GRPO Pipeline:** ```bash python RL/train_battleground_rlaif.py \ --model Qwen/Qwen3-4B-Instruct \ --data RL/datasets/battleground_rlaif_multicandidate.jsonl \ --output ./battleground_rlaif_qwen \ --sft_epochs 3 \ --grpo_epochs 3 ``` **With Game History Context:** ```bash python RL/train_battleground_rlaif_gamehistory.py \ --model Qwen/Qwen3-4B-Instruct \ --output ./battleground_rlaif_qwen_gamehistory ``` ### Training Configuration | Parameter | Default | Description | |-----------|---------|-------------| | `--model` | `Qwen/Qwen3-4B-Instruct` | Base model path | | `--sft_epochs` | 3 | SFT training epochs | | `--grpo_epochs` | 3 | GRPO training epochs | | `--per_device_batch_size` | 4 | Batch size per GPU | | `--sft_learning_rate` | 1e-5 | SFT learning rate | | `--grpo_learning_rate` | 5e-6 | GRPO learning rate | | `--max_seq_length` | 1024 | Maximum sequence length | | `--skip_sft` | False | Skip SFT phase | | `--skip_grpo` | False | Skip GRPO phase | ## Docker Deployment ```bash docker build -t deepbattler-rl . docker run -p 7860:7860 --gpus all deepbattler-rl ``` For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment. ## Action Types The model outputs JSON action sequences with these action types: | Action Type | Description | |-------------|-------------| | `BUY_FROM_TAVERN` | Purchase a minion from the tavern | | `PLAY_FROM_HAND` | Play a minion from hand to board | | `SELL_FROM_BOARD` | Sell a minion from the board | | `HERO_POWER` | Activate hero power | | `ROLL` | Refresh the tavern | | `UPGRADE_TAVERN` | Upgrade tavern tier | | `FREEZE` | Freeze the current tavern | | `END_TURN` | End the current turn | ## Related Components - **DeepBattlerPlugin/** - C# HDT plugin that extracts game state to JSON - **Agent/** - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma) For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler). ## License This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms. ---