Spaces:

iteratehack
/

deepbattler

Sleeping

App Files Files Community

deepbattler / README.md

lbtwyk

Update README to focus on RL training pipeline

fed1ca7 14 days ago

preview code

raw

history blame

6 kB

	---
	title: DeepBattler-RL
	emoji: 🧠
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	---

	# DeepBattler-RL: Reinforcement Learning Agents for Hearthstone Battlegrounds

	This repository contains the RL training and inference pipeline for DeepBattler, combining RLHF (Reinforcement Learning from Human Feedback) on human expert actions with RLAIF optimized using GRPO (Group Relative Policy Optimization) to train a policy model for Hearthstone Battlegrounds decision-making.

	## Overview

	DeepBattler-RL fine-tunes a Qwen3-4B-Instruct model in two stages: an SFT warmup on human expert trajectories (RLHF-style) and a GRPO phase that uses multi-candidate feedback (RLAIF) to do the heavy lifting. The trained model is served via a FastAPI endpoint for real-time inference.

	Key Features:
	- SFT + GRPO Training Pipeline - SFT warmup on human expert (RLHF-style) data, then GRPO as the main optimization step
	- RLHF + Multi-Candidate RLAIF - Human expert actions as `expert` candidates plus additional medium/bad actions for preference-based GRPO
	- LoRA Fine-tuning - Efficient parameter-efficient training with PEFT
	- FastAPI Inference Server - Production-ready API for action generation
	- Docker Deployment - Ready for HuggingFace Spaces or self-hosted deployment

	## Project Structure

	```
	DeepBattler-RL/
	├── RL/ # Core RL training & evaluation
	│ ├── train_battleground_rlaif.py # SFT + GRPO training pipeline
	│ ├── train_battleground_rlaif_gamehistory.py # Training with game history context
	│ ├── eval_battleground_rlaif.py # Evaluation scripts
	│ ├── infer_battleground_cloud.py # Cloud inference utilities
	│ ├── battleground_nl_utils.py # Game state to natural language conversion
	│ └── datasets/ # Training data (JSONL format)
	├── app.py # FastAPI inference server
	├── Dockerfile # Docker deployment config
	├── requirements.txt # Python dependencies
	├── Agent/ # LLM agent callers (OpenAI, Gemma)
	└── DeepBattlerPlugin/ # HDT plugin for game state extraction
	```

	## Quick Start

	### Installation

	```bash
	pip install -r requirements.txt
	```

	Requirements:
	- Python 3.10+
	- PyTorch >= 2.1.0
	- CUDA (recommended for training)

	### Running the Inference Server

	```bash
	uvicorn app:app --host 0.0.0.0 --port 7860
	```

	The server loads:
	- Base Model: `Qwen/Qwen3-4B-Instruct-2507`
	- LoRA Adapter: `iteratehack/battleground-rlaif-qwen-gamehistory-grpo`

	### API Usage

	POST `/generate_actions`

	```json
	{
	"phase": "PlayerTurn",
	"turn": 5,
	"state": {
	"game_state": { ... },
	"tavern": [ ... ],
	"hand": [ ... ],
	"board": [ ... ]
	},
	"max_new_tokens": 256,
	"temperature": 0.2
	}
	```

	Response:
	```json
	{
	"actions": [
	{"type": "BUY_FROM_TAVERN", "tavern_index": 2, "card_name": "Sellemental"},
	{"type": "PLAY_FROM_HAND", "hand_index": 0, "board_index": 0},
	{"type": "END_TURN"}
	],
	"raw_completion": "..."
	}
	```

	## Training

	### Dataset Format

	Training data is stored in JSONL format under `RL/datasets/`:

	```json
	{
	"game_id": "...",
	"step_id": 0,
	"turn": 3,
	"phase": "PlayerTurn",
	"state": { ... },
	"candidates": [
	{"role": "expert", "action": {...}, "reward": 1.0},
	{"role": "medium", "action": {...}, "reward": 0.5},
	{"role": "bad", "action": {...}, "reward": -0.5}
	]
	}
	```

	Here the `expert` role corresponds to human expert actions (the RLHF component), while the other roles provide additional candidates used for RLAIF with GRPO.

	### Running Training

	SFT + GRPO Pipeline:

	```bash
	python RL/train_battleground_rlaif.py \
	--model Qwen/Qwen3-4B-Instruct \
	--data RL/datasets/battleground_rlaif_multicandidate.jsonl \
	--output ./battleground_rlaif_qwen \
	--sft_epochs 3 \
	--grpo_epochs 3
	```

	With Game History Context:

	```bash
	python RL/train_battleground_rlaif_gamehistory.py \
	--model Qwen/Qwen3-4B-Instruct \
	--output ./battleground_rlaif_qwen_gamehistory
	```

	### Training Configuration

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `--model` \| `Qwen/Qwen3-4B-Instruct` \| Base model path \|
	\| `--sft_epochs` \| 3 \| SFT training epochs \|
	\| `--grpo_epochs` \| 3 \| GRPO training epochs \|
	\| `--per_device_batch_size` \| 4 \| Batch size per GPU \|
	\| `--sft_learning_rate` \| 1e-5 \| SFT learning rate \|
	\| `--grpo_learning_rate` \| 5e-6 \| GRPO learning rate \|
	\| `--max_seq_length` \| 1024 \| Maximum sequence length \|
	\| `--skip_sft` \| False \| Skip SFT phase \|
	\| `--skip_grpo` \| False \| Skip GRPO phase \|

	## Docker Deployment

	```bash
	docker build -t deepbattler-rl .
	docker run -p 7860:7860 --gpus all deepbattler-rl
	```

	For HuggingFace Spaces, the Dockerfile is pre-configured for automatic deployment.

	## Action Types

	The model outputs JSON action sequences with these action types:

	\| Action Type \| Description \|
	\|-------------\|-------------\|
	\| `BUY_FROM_TAVERN` \| Purchase a minion from the tavern \|
	\| `PLAY_FROM_HAND` \| Play a minion from hand to board \|
	\| `SELL_FROM_BOARD` \| Sell a minion from the board \|
	\| `HERO_POWER` \| Activate hero power \|
	\| `ROLL` \| Refresh the tavern \|
	\| `UPGRADE_TAVERN` \| Upgrade tavern tier \|
	\| `FREEZE` \| Freeze the current tavern \|
	\| `END_TURN` \| End the current turn \|

	## Related Components

	- DeepBattlerPlugin/ - C# HDT plugin that extracts game state to JSON
	- Agent/ - Python agents for real-time voice-assisted gameplay (OpenAI/Gemma)

	For the full DeepBattler experience with HDT integration, see the main [DeepBattler repository](https://github.com/William-Dic/DeepBattler).

	## License

	This software is available for personal, educational, and non-commercial use. See the main DeepBattler repository for full license terms.

	---