Spaces:
Sleeping
Sleeping
Atharva commited on
Commit ·
846472c
1
Parent(s): ca70d21
Track OpenEnv package and refresh docs
Browse files- .gitignore +0 -1
- README.md +85 -137
- env/__init__.py +13 -0
- env/models.py +40 -0
- env/server/__init__.py +2 -0
- env/server/app.py +24 -0
- env/server/environment.py +49 -0
.gitignore
CHANGED
|
@@ -14,7 +14,6 @@ build/
|
|
| 14 |
# Virtual environments
|
| 15 |
.venv/
|
| 16 |
venv/
|
| 17 |
-
env/
|
| 18 |
|
| 19 |
# IDE / OS
|
| 20 |
.idea/
|
|
|
|
| 14 |
# Virtual environments
|
| 15 |
.venv/
|
| 16 |
venv/
|
|
|
|
| 17 |
|
| 18 |
# IDE / OS
|
| 19 |
.idea/
|
README.md
CHANGED
|
@@ -2,207 +2,155 @@
|
|
| 2 |
title: OpenEnv-WolfeClick Environment
|
| 3 |
emoji: 🎮
|
| 4 |
colorFrom: blue
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
| 7 |
app_port: 8001
|
| 8 |
tags:
|
| 9 |
- openenv
|
| 10 |
- pokemon
|
| 11 |
- rl
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
# OpenEnv-WolfeClick
|
| 15 |
|
| 16 |
-
OpenEnv-WolfeClick is
|
| 17 |
|
| 18 |
-
The
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
-
|
| 27 |
-
-
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
-
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl): environment, state formatting, action space, reward shaping, and client code
|
| 37 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb): main Colab notebook for warm-up SFT, rollout collection, and GRPO training
|
| 38 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples): small local examples
|
| 39 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml): package metadata
|
| 40 |
|
| 41 |
## Environment design
|
| 42 |
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
The state is not a raw simulator dump. It is a structured markdown representation designed to preserve strategic information while remaining readable to an LLM.
|
| 46 |
-
|
| 47 |
-
Each prompt includes:
|
| 48 |
|
| 49 |
- active self Pokemon
|
| 50 |
- active opponent Pokemon
|
| 51 |
-
- HP, status,
|
| 52 |
-
- full self
|
| 53 |
-
- opponent history
|
| 54 |
-
- exact legal actions
|
| 55 |
-
|
| 56 |
-
This is implemented through the environment wrapper and state formatter:
|
| 57 |
|
| 58 |
-
|
| 59 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py)
|
| 60 |
-
|
| 61 |
-
My design goal was to expose enough information for strategic decisions without giving the model shortcuts that bypass the game structure.
|
| 62 |
-
|
| 63 |
-
### Action design
|
| 64 |
-
|
| 65 |
-
The action space is deliberately constrained.
|
| 66 |
-
|
| 67 |
-
The model must emit exactly one JSON object:
|
| 68 |
|
| 69 |
```json
|
| 70 |
{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
|
| 71 |
```
|
| 72 |
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py)
|
| 76 |
-
|
| 77 |
-
This module does three important things:
|
| 78 |
-
|
| 79 |
-
- enumerates legal moves and switches for the turn
|
| 80 |
-
- builds the action instruction block shown to the model
|
| 81 |
-
- validates model outputs against the legal action set
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
Reward computation lives in:
|
| 90 |
-
|
| 91 |
-
- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py)
|
| 92 |
-
|
| 93 |
-
The reward includes:
|
| 94 |
-
|
| 95 |
-
- damage dealt to the opponent
|
| 96 |
-
- damage taken by the agent
|
| 97 |
- knockouts and faint penalties
|
| 98 |
- healing value
|
| 99 |
- setup value and opponent setup penalties
|
| 100 |
-
- passive damage
|
| 101 |
-
- status penalties
|
| 102 |
-
|
| 103 |
-
The environment wrapper in [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py) adds practical rollout constraints:
|
| 104 |
-
|
| 105 |
-
- illegal action fallback handling
|
| 106 |
- illegal action penalties
|
| 107 |
-
- anti-stall
|
| 108 |
-
- battle length caps
|
| 109 |
-
- no-progress termination penalties
|
| 110 |
-
|
| 111 |
-
This separation is intentional:
|
| 112 |
-
|
| 113 |
-
- `reward.py` captures battle-quality shaping
|
| 114 |
-
- the env wrapper handles rollout hygiene and training throughput
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
##
|
| 119 |
|
| 120 |
-
The
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
|
| 127 |
|
| 128 |
-
|
| 129 |
-
- `collected_action`
|
| 130 |
-
- `collected_reward`
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
The
|
| 137 |
|
| 138 |
-
|
|
|
|
|
|
|
| 139 |
|
| 140 |
-
|
| 141 |
-
- valid but different actions are penalized lightly
|
| 142 |
-
- the action matching the executed rollout action receives the collected environment reward plus a positive margin
|
| 143 |
|
| 144 |
-
|
| 145 |
|
| 146 |
-
|
| 147 |
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
4. Enumerate legal actions.
|
| 152 |
-
5. Generate one JSON action from the model.
|
| 153 |
-
6. Execute the action in the environment.
|
| 154 |
-
7. Receive next state, reward, done flag, and info.
|
| 155 |
-
8. Store rollout rows.
|
| 156 |
-
9. Train with GRPO on the collected rows.
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
|
|
|
|
|
|
| 161 |
|
| 162 |
-
|
| 163 |
|
| 164 |
```bash
|
| 165 |
-
python3
|
| 166 |
```
|
| 167 |
|
| 168 |
-
##
|
| 169 |
|
| 170 |
-
|
|
|
|
|
|
|
| 171 |
|
| 172 |
-
|
| 173 |
|
| 174 |
-
|
| 175 |
-
2. installs the training stack
|
| 176 |
-
3. loads the model and LoRA adapter
|
| 177 |
-
4. starts a local Pokemon Showdown server
|
| 178 |
-
5. runs JSON warm-up SFT
|
| 179 |
-
6. collects rollout data from real battles
|
| 180 |
-
7. trains with GRPO
|
| 181 |
-
8. optionally saves the adapter to Hugging Face Hub
|
| 182 |
|
| 183 |
-
|
| 184 |
|
| 185 |
-
-
|
| 186 |
-
-
|
| 187 |
-
-
|
| 188 |
|
| 189 |
-
|
| 190 |
|
| 191 |
-
|
|
|
|
| 192 |
|
| 193 |
-
|
| 194 |
-
- valid action JSON is produced reliably after warm-up
|
| 195 |
-
- GRPO can train on real rollout data in the non-quantized plain TRL path
|
| 196 |
|
| 197 |
-
|
|
|
|
| 198 |
|
| 199 |
-
##
|
| 200 |
|
| 201 |
-
This
|
| 202 |
|
| 203 |
-
|
|
|
|
|
|
|
|
|
|
| 204 |
|
| 205 |
-
|
| 206 |
-
- Hugging Face Space using OpenEnv stable release `0.2.1`
|
| 207 |
-
- benchmark/results file
|
| 208 |
-
- 1-minute demo video
|
|
|
|
| 2 |
title: OpenEnv-WolfeClick Environment
|
| 3 |
emoji: 🎮
|
| 4 |
colorFrom: blue
|
| 5 |
+
colorTo: slate
|
| 6 |
sdk: docker
|
| 7 |
app_port: 8001
|
| 8 |
tags:
|
| 9 |
- openenv
|
| 10 |
- pokemon
|
| 11 |
- rl
|
| 12 |
+
- multi-agent
|
| 13 |
---
|
| 14 |
|
| 15 |
# OpenEnv-WolfeClick
|
| 16 |
|
| 17 |
+
OpenEnv-WolfeClick is an OpenEnv-compatible environment for training LLMs in competitive Pokemon Showdown battles.
|
| 18 |
|
| 19 |
+
The core idea is simple: rock-paper-scissors already shows that cyclic matchups create nontrivial reasoning. Competitive Pokemon scales that into a much richer world with hidden information, constrained legal actions, long-term resource tradeoffs, and an active opponent. This repo turns that setting into a trainable environment with a clean `reset()` / `step()` loop, an OpenEnv server wrapper, and a Colab GRPO training workflow.
|
| 20 |
|
| 21 |
+
## What is here
|
| 22 |
|
| 23 |
+
- `src/smogon_rl/`
|
| 24 |
+
- core environment logic, state formatting, action validation, reward shaping, and the poke-env client
|
| 25 |
+
- `env/`
|
| 26 |
+
- OpenEnv server package exposing the environment through `env.server.app:app`
|
| 27 |
+
- `trainer.ipynb`
|
| 28 |
+
- Colab notebook for rollout collection and GRPO training
|
| 29 |
+
- `watch_battle.ipynb`
|
| 30 |
+
- Colab notebook for running one live watched battle
|
| 31 |
+
- `benckmarks/benchmark.ipynb`
|
| 32 |
+
- quick checkpoint-vs-checkpoint benchmark notebook
|
| 33 |
+
- `openenv.yaml`
|
| 34 |
+
- OpenEnv entrypoint config
|
| 35 |
+
- `Dockerfile`
|
| 36 |
+
- HF Spaces / Docker deployment path for the OpenEnv server
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Environment design
|
| 39 |
|
| 40 |
+
Each turn, the model receives a structured markdown state containing:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
- active self Pokemon
|
| 43 |
- active opponent Pokemon
|
| 44 |
+
- HP, status, item, ability, and stat modifiers
|
| 45 |
+
- full self roster and currently known moves
|
| 46 |
+
- revealed opponent history
|
| 47 |
+
- the exact legal actions for the turn
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
The model must output exactly one JSON action:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
```json
|
| 52 |
{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
|
| 53 |
```
|
| 54 |
|
| 55 |
+
This keeps the interface concrete and legally grounded. The environment validates the action, executes it in a real Showdown battle, and returns the next state, reward, and episode metadata.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
## Reward
|
| 58 |
|
| 59 |
+
Rewards are shaped, but still tied to battle progress. The signal includes:
|
| 60 |
|
| 61 |
+
- damage dealt and damage taken
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
- knockouts and faint penalties
|
| 63 |
- healing value
|
| 64 |
- setup value and opponent setup penalties
|
| 65 |
+
- passive damage and status effects
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
- illegal action penalties
|
| 67 |
+
- small anti-stall / truncation penalties
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
+
The goal is to create a denser learning signal without turning the task into a toy proxy objective.
|
| 70 |
|
| 71 |
+
## Training workflow
|
| 72 |
|
| 73 |
+
The training path is:
|
| 74 |
|
| 75 |
+
1. start local Pokemon Showdown in Colab
|
| 76 |
+
2. collect real rollout trajectories from live battles
|
| 77 |
+
3. store prompt, chosen action, and environment reward
|
| 78 |
+
4. train a LoRA adapter with GRPO on those real trajectories
|
| 79 |
+
5. benchmark checkpoints against each other on the same env budget
|
| 80 |
|
| 81 |
+
The repo includes both the training notebook and a smaller watch notebook for running one live battle with a chosen checkpoint.
|
| 82 |
|
| 83 |
+
## OpenEnv package
|
| 84 |
|
| 85 |
+
Yes, this repo includes a real OpenEnv environment package.
|
|
|
|
|
|
|
| 86 |
|
| 87 |
+
The deployable server lives at:
|
| 88 |
|
| 89 |
+
- `env/server/app.py`
|
| 90 |
+
- `env/server/environment.py`
|
| 91 |
+
- `env/models.py`
|
| 92 |
|
| 93 |
+
The OpenEnv config points to:
|
| 94 |
|
| 95 |
+
```yaml
|
| 96 |
+
app: env.server.app:app
|
| 97 |
+
```
|
| 98 |
|
| 99 |
+
That package wraps the local `PokemonShowdownEnv` and exposes it through the OpenEnv server interface.
|
|
|
|
|
|
|
| 100 |
|
| 101 |
+
## Local usage
|
| 102 |
|
| 103 |
+
Install the local package:
|
| 104 |
|
| 105 |
+
```bash
|
| 106 |
+
python3 -m pip install -e .
|
| 107 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
Run a simple local episode:
|
| 110 |
|
| 111 |
+
```bash
|
| 112 |
+
python3 examples/run_single_episode.py
|
| 113 |
+
```
|
| 114 |
|
| 115 |
+
Run one watched model battle:
|
| 116 |
|
| 117 |
```bash
|
| 118 |
+
python3 examples/watch_model_battle.py --revision grpo-qwen3-4b-run2
|
| 119 |
```
|
| 120 |
|
| 121 |
+
## Colab usage
|
| 122 |
|
| 123 |
+
- `trainer.ipynb`: collect rollouts and train with GRPO
|
| 124 |
+
- `watch_battle.ipynb`: start Showdown, load a checkpoint, and run one live battle
|
| 125 |
+
- `benckmarks/benchmark.ipynb`: compare checkpoints quickly
|
| 126 |
|
| 127 |
+
These notebooks assume a GPU runtime for model inference/training.
|
| 128 |
|
| 129 |
+
## Deployment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
+
The repo includes the files needed for an OpenEnv-style deployment:
|
| 132 |
|
| 133 |
+
- `openenv.yaml`
|
| 134 |
+
- `Dockerfile`
|
| 135 |
+
- `env/` package
|
| 136 |
|
| 137 |
+
The Docker image starts:
|
| 138 |
|
| 139 |
+
- local Pokemon Showdown on port `8000`
|
| 140 |
+
- the OpenEnv FastAPI server on port `8001`
|
| 141 |
|
| 142 |
+
## Current artifacts
|
|
|
|
|
|
|
| 143 |
|
| 144 |
+
- HF model repo: `Atharva2099/openenv-smogon-rl`
|
| 145 |
+
- adapter revisions: `grpo-qwen3-4b-run1`, `grpo-qwen3-4b-run2`
|
| 146 |
|
| 147 |
+
## Status
|
| 148 |
|
| 149 |
+
This is a working end-to-end environment and training repo:
|
| 150 |
|
| 151 |
+
- live battle rollouts work
|
| 152 |
+
- GRPO training on real trajectories works
|
| 153 |
+
- checkpoint benchmarking works
|
| 154 |
+
- the OpenEnv server package exists in-repo
|
| 155 |
|
| 156 |
+
The next useful polish steps are HF Spaces deployment validation, README/blog cleanup, and a short write-up/demo around the environment design and results.
|
|
|
|
|
|
|
|
|
env/__init__.py
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
OpenEnv-compatible environment package for the WolfeClick Pokemon RL env.
|
| 3 |
+
|
| 4 |
+
Exports the model types so users can do:
|
| 5 |
+
|
| 6 |
+
from env import WolfeClickAction, WolfeClickObservation, WolfeClickState
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from .models import WolfeClickAction, WolfeClickObservation, WolfeClickState
|
| 10 |
+
|
| 11 |
+
__all__ = ["WolfeClickAction", "WolfeClickObservation", "WolfeClickState"]
|
| 12 |
+
|
| 13 |
+
|
env/models.py
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import Any, Dict
|
| 5 |
+
|
| 6 |
+
from openenv.core.env_server import Action, Observation, State
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
@dataclass
|
| 10 |
+
class WolfeClickAction(Action):
|
| 11 |
+
"""Single step action for the environment.
|
| 12 |
+
|
| 13 |
+
This wraps the constrained JSON interface already used by the local env:
|
| 14 |
+
|
| 15 |
+
{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
action_json: str
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@dataclass
|
| 22 |
+
class WolfeClickObservation(Observation):
|
| 23 |
+
"""Markdown battle state plus metadata."""
|
| 24 |
+
|
| 25 |
+
state_markdown: str
|
| 26 |
+
# Whether the episode is finished (battle over or truncated).
|
| 27 |
+
done: bool = False
|
| 28 |
+
# Shaped reward from the environment.
|
| 29 |
+
reward: float = 0.0
|
| 30 |
+
# Free-form metadata mirrored from the underlying env's info dict.
|
| 31 |
+
metadata: Dict[str, Any] | None = None
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class WolfeClickState(State):
|
| 35 |
+
"""Thin wrapper around the core State model."""
|
| 36 |
+
|
| 37 |
+
# We rely on the base State fields (episode_id, step_count).
|
| 38 |
+
# Any extra per-episode bookkeeping can live on the environment itself.
|
| 39 |
+
pass
|
| 40 |
+
|
env/server/__init__.py
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Server package for the WolfeClick OpenEnv environment."""
|
| 2 |
+
|
env/server/app.py
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from openenv.core.env_server import create_app
|
| 4 |
+
|
| 5 |
+
from env.models import WolfeClickAction, WolfeClickObservation
|
| 6 |
+
from env.server.environment import WolfeClickEnvironment
|
| 7 |
+
|
| 8 |
+
app = create_app(
|
| 9 |
+
WolfeClickEnvironment,
|
| 10 |
+
WolfeClickAction,
|
| 11 |
+
WolfeClickObservation,
|
| 12 |
+
env_name="openenv-wolfeclick",
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def main() -> None:
|
| 17 |
+
import uvicorn
|
| 18 |
+
|
| 19 |
+
uvicorn.run(app, host="0.0.0.0", port=8001)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
if __name__ == "__main__":
|
| 23 |
+
main()
|
| 24 |
+
|
env/server/environment.py
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import uuid
|
| 4 |
+
from typing import Any, Dict
|
| 5 |
+
|
| 6 |
+
from openenv.core.env_server import Environment
|
| 7 |
+
|
| 8 |
+
from env.models import WolfeClickAction, WolfeClickObservation, WolfeClickState
|
| 9 |
+
from smogon_rl.config import EnvConfig
|
| 10 |
+
from smogon_rl.openenv_sync_env import PokemonShowdownEnv
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class WolfeClickEnvironment(Environment[WolfeClickAction, WolfeClickObservation, WolfeClickState]):
|
| 14 |
+
"""OpenEnv server wrapper around the local PokemonShowdownEnv."""
|
| 15 |
+
|
| 16 |
+
def __init__(self) -> None:
|
| 17 |
+
super().__init__()
|
| 18 |
+
self._env = PokemonShowdownEnv(config=EnvConfig())
|
| 19 |
+
# Underlying State tracks episode_id and step_count; we keep a separate battle counter.
|
| 20 |
+
self._state = WolfeClickState(episode_id=str(uuid.uuid4()), step_count=0)
|
| 21 |
+
self._battle_index: int = 0
|
| 22 |
+
|
| 23 |
+
def reset(self, **kwargs: Any) -> WolfeClickObservation:
|
| 24 |
+
"""Start a new battle and return the initial observation."""
|
| 25 |
+
self._battle_index += 1
|
| 26 |
+
self._state = WolfeClickState(episode_id=str(uuid.uuid4()), step_count=0)
|
| 27 |
+
state_str = self._env.reset()
|
| 28 |
+
return WolfeClickObservation(
|
| 29 |
+
state_markdown=state_str,
|
| 30 |
+
done=False,
|
| 31 |
+
reward=0.0,
|
| 32 |
+
metadata={"battle_index": self._battle_index},
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
def step(self, action: WolfeClickAction, **kwargs: Any) -> WolfeClickObservation:
|
| 36 |
+
"""Apply one JSON action and return the next observation."""
|
| 37 |
+
self._state.step_count += 1 # type: ignore[attr-defined]
|
| 38 |
+
obs_str, reward, done, info = self._env.step(action.action_json)
|
| 39 |
+
return WolfeClickObservation(
|
| 40 |
+
state_markdown=obs_str,
|
| 41 |
+
done=bool(done),
|
| 42 |
+
reward=float(reward),
|
| 43 |
+
metadata=info or {},
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
@property
|
| 47 |
+
def state(self) -> WolfeClickState:
|
| 48 |
+
return self._state
|
| 49 |
+
|