Spaces:

Atharva2099
/

WolfeClick

Running

App Files Files Community

WolfeClick / README.md

Atharva

Fix colorTo in README frontmatter for HF compatibility

fcf731c about 1 month ago

preview code

raw

history blame contribute delete

5.56 kB

	---
	title: OpenEnv-WolfeClick Environment
	emoji: 🎮
	colorFrom: blue
	colorTo: gray
	sdk: docker
	app_port: 7860
	tags:
	- openenv
	- pokemon
	- rl
	- multi-agent
	---

	# OpenEnv-WolfeClick

	[![HF Space](https://img.shields.io/badge/HF%20Space-Live%20Demo-blue)](https://huggingface.co/spaces/Atharva2099/OpenEnv-WolfeClick)
	[![Model](https://img.shields.io/badge/HF%20Model-Weights-orange)](https://huggingface.co/Atharva2099/openenv-smogon-rl)
	[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)

	An [OpenEnv](https://github.com/openenv)-compatible environment for training LLMs to play competitive Pokemon Showdown battles using GRPO.

	Competitive Pokemon has hidden information, constrained legal actions, long-term resource tradeoffs, and an active opponent. This repo turns that setting into a trainable RL environment with a `reset()` / `step()` loop, shaped rewards, an OpenEnv server wrapper, and a GRPO training pipeline.

	> [Try the live demo](https://huggingface.co/spaces/Atharva2099/OpenEnv-WolfeClick) — watch a GRPO-trained model play a full battle turn by turn.

	## Quick Start

	```bash
	git clone https://github.com/Atharva2099/OpenEnv-WolfeClick.git
	cd OpenEnv-WolfeClick
	pip install -e .

	# Run a battle with random actions (needs local Pokemon Showdown on port 8000)
	python examples/run_single_episode.py

	# Watch a trained model battle
	python examples/watch_model_battle.py --revision grpo-qwen3-4b-run3
	```

	## Project Structure

	```
	src/smogon_rl/ Core environment: state formatting, action validation,
	reward shaping, poke-env client
	env/ OpenEnv server package (env.server.app:app)
	examples/ Runnable scripts for local battles
	trainer.ipynb Colab: rollout collection + GRPO training
	watch_battle.ipynb Colab: run one live watched battle
	benchmarks/ Checkpoint comparison notebook + results
	record_battle.py Record a battle to JSON for replay
	space_app.py Gradio HF Space battle viewer
	openenv.yaml OpenEnv deployment config
	Dockerfile HF Spaces Docker deployment
	```

	## Environment Design

	Each turn the model receives a structured markdown state:

	\| Section \| Contents \|
	\|---\|---\|
	\| Part A: Active Field \| Active Pokemon for both sides — HP, status, ability, item, stat modifiers, opponent speed range \|
	\| Part B: Full Self Roster \| All 6 team Pokemon with HP, status, item, and known moves (type + base power) \|
	\| Part C: Opponent History \| Every revealed opponent Pokemon — last known HP, status, moves, items, abilities \|

	The model outputs one JSON action:

	```json
	{"action": "move" \| "switch", "choice": "Exact Name of Move or Pokemon"}
	```

	Up to 4 moves and 5 switches are available per turn. The environment validates the action, executes it in a real Showdown battle, and returns the next state + shaped reward.

	## Reward Shaping

	Dense reward signal tied to battle progress:

	\| Component \| Signal \|
	\|---\|---\|
	\| Damage dealt \| +1.0 per 10% opponent HP reduced \|
	\| Damage taken \| -1.0 per 10% self HP lost \|
	\| Knockouts \| +3.0 per opponent faint, -3.0 per self faint \|
	\| Healing \| +1.0 per 10% healed (capped 3.0/battle) \|
	\| Setup \| +0.5 per stat stage gained (capped 2.0/mon) \|
	\| Type effectiveness \| +0.5 super effective, -1.0 immune \|
	\| Illegal action \| -10.0 for hallucinated moves/Pokemon \|
	\| Step penalty \| -0.05 per turn (anti-stall) \|

	## Training Pipeline

	```
	Base Model (Qwen3-4B-Instruct)
	\|
	[JSON Warm-up SFT] establish legal action baseline
	\|
	[Rollout Collection] live Pokemon Showdown battles
	\|
	[GRPO Training] optimize policy on real trajectories
	\|
	LoRA Checkpoint ---> Hugging Face Hub
	```

	1. Start local Pokemon Showdown in Colab
	2. Collect rollout trajectories from live battles
	3. Store prompt, chosen action, and environment reward
	4. Train a LoRA adapter with GRPO on real trajectories
	5. Benchmark checkpoints against each other

	## Architecture

	```
	Pokemon Showdown (Node.js, port 8000)
	\| WebSocket
	PokeEnvClient (async background loop)
	\|-- RLPlayer (queue-driven)
	\|-- RandomPlayer (opponent)
	\|
	PokemonShowdownEnv (sync wrapper: reset/step)
	\|-- state_formatter -> markdown state for LLM
	\|-- action_space -> JSON validation + matching
	\|-- reward calculator -> shaped multi-component reward
	\|
	OpenEnv Server (FastAPI on port 8001)
	```

	## Trained Checkpoints

	Model repo: [`Atharva2099/openenv-smogon-rl`](https://huggingface.co/Atharva2099/openenv-smogon-rl)

	\| Checkpoint \| Description \|
	\|---\|---\|
	\| `grpo-qwen3-4b-run1` \| First GRPO training run \|
	\| `grpo-qwen3-4b-run2` \| Second run, tuned reward shaping \|
	\| `grpo-qwen3-4b-run3` \| Third run, best performing \|

	## Notebooks

	\| Notebook \| Purpose \|
	\|---\|---\|
	\| `trainer.ipynb` \| Rollout collection + GRPO training (Colab GPU) \|
	\| `watch_battle.ipynb` \| Run one live watched battle \|
	\| `benchmarks/benchmark.ipynb` \| Compare checkpoint performance \|

	## OpenEnv Server

	The environment follows the OpenEnv standard. Config:

	```yaml
	# openenv.yaml
	spec_version: 1
	name: openenv-wolfeclick
	type: space
	runtime: fastapi
	app: env.server.app:app
	port: 8001
	```

	Server package: `env/server/app.py`, `env/server/environment.py`, `env/models.py`

	## HF Spaces Deployment

	The Dockerfile builds a lightweight Gradio app that replays pre-recorded model battles:

	```bash
	docker build -t wolfeclick . && docker run -p 7860:7860 wolfeclick
	```

	## License

	MIT