Atharva commited on
Commit
846472c
·
1 Parent(s): ca70d21

Track OpenEnv package and refresh docs

Browse files
.gitignore CHANGED
@@ -14,7 +14,6 @@ build/
14
  # Virtual environments
15
  .venv/
16
  venv/
17
- env/
18
 
19
  # IDE / OS
20
  .idea/
 
14
  # Virtual environments
15
  .venv/
16
  venv/
 
17
 
18
  # IDE / OS
19
  .idea/
README.md CHANGED
@@ -2,207 +2,155 @@
2
  title: OpenEnv-WolfeClick Environment
3
  emoji: 🎮
4
  colorFrom: blue
5
- colorTo: purple
6
  sdk: docker
7
  app_port: 8001
8
  tags:
9
  - openenv
10
  - pokemon
11
  - rl
 
12
  ---
13
 
14
  # OpenEnv-WolfeClick
15
 
16
- OpenEnv-WolfeClick is a reinforcement learning environment and training workflow for competitive Pokemon battles with large language models.
17
 
18
- The project was built for the OpenEnv hackathon to answer a specific question: can an LLM learn to act in a partially observable, adversarial, long-horizon environment where legal actions are constrained, rewards are delayed, and the opponent is another agent?
19
 
20
- This repo focuses on that environment and a minimal Colab training path.
21
 
22
- ## Why I Built This
23
-
24
- Pokemon battles are a strong multi-agent training environment for LLMs because they require:
25
-
26
- - hidden information and opponent modeling
27
- - long-horizon planning over many turns
28
- - legal action grounding under a constrained action space
29
- - adapting to a changing world state after every action
30
- - balancing local rewards against later consequences
31
-
32
- I built this environment to make those properties trainable with a simple `reset()` / `step()` loop and a small JSON action interface.
33
-
34
- ## What is in this repo
35
-
36
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl): environment, state formatting, action space, reward shaping, and client code
37
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb): main Colab notebook for warm-up SFT, rollout collection, and GRPO training
38
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples): small local examples
39
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml): package metadata
40
 
41
  ## Environment design
42
 
43
- ### State design
44
-
45
- The state is not a raw simulator dump. It is a structured markdown representation designed to preserve strategic information while remaining readable to an LLM.
46
-
47
- Each prompt includes:
48
 
49
  - active self Pokemon
50
  - active opponent Pokemon
51
- - HP, status, ability, item, and current stat modifiers
52
- - full self team roster with currently known moves
53
- - opponent history and revealed information
54
- - exact legal actions available this turn
55
-
56
- This is implemented through the environment wrapper and state formatter:
57
 
58
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py)
59
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py)
60
-
61
- My design goal was to expose enough information for strategic decisions without giving the model shortcuts that bypass the game structure.
62
-
63
- ### Action design
64
-
65
- The action space is deliberately constrained.
66
-
67
- The model must emit exactly one JSON object:
68
 
69
  ```json
70
  {"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
71
  ```
72
 
73
- At every step, legal actions are enumerated from the current battle state using:
74
-
75
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py)
76
-
77
- This module does three important things:
78
-
79
- - enumerates legal moves and switches for the turn
80
- - builds the action instruction block shown to the model
81
- - validates model outputs against the legal action set
82
 
83
- This matters because I do not want the model to “sort of” describe an action. I want the environment to enforce a concrete legal interface.
84
 
85
- ### Reward design
86
 
87
- The environment reward is shaped but still tied to battle outcomes.
88
-
89
- Reward computation lives in:
90
-
91
- - [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py)
92
-
93
- The reward includes:
94
-
95
- - damage dealt to the opponent
96
- - damage taken by the agent
97
  - knockouts and faint penalties
98
  - healing value
99
  - setup value and opponent setup penalties
100
- - passive damage value
101
- - status penalties
102
-
103
- The environment wrapper in [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py) adds practical rollout constraints:
104
-
105
- - illegal action fallback handling
106
  - illegal action penalties
107
- - anti-stall living penalty
108
- - battle length caps
109
- - no-progress termination penalties
110
-
111
- This separation is intentional:
112
-
113
- - `reward.py` captures battle-quality shaping
114
- - the env wrapper handles rollout hygiene and training throughput
115
 
116
- ## Training design
117
 
118
- ### 1. Warm-up SFT
119
 
120
- The notebook begins with a supervised warm-up stage so the model learns to emit valid action JSON for the battle-state prompt format.
121
 
122
- This does not claim strategic mastery. It only ensures the model is good enough to participate in the environment without collapsing into malformed outputs.
 
 
 
 
123
 
124
- ### 2. Real rollout collection
125
 
126
- The policy is then run in real Pokemon Showdown battles. For each turn, the notebook stores:
127
 
128
- - `prompt`
129
- - `collected_action`
130
- - `collected_reward`
131
 
132
- This makes the rollout data usable for GRPO training while preserving the exact environment reward signal.
133
 
134
- ### 3. GRPO training
 
 
135
 
136
- The GRPO reward used in the notebook is a wrapper around the stored rollout reward.
137
 
138
- It is designed to preserve ranking pressure inside a completion group:
 
 
139
 
140
- - malformed output is penalized strongly
141
- - valid but different actions are penalized lightly
142
- - the action matching the executed rollout action receives the collected environment reward plus a positive margin
143
 
144
- That matters because raw rollout rewards alone do not always create a clean learning signal for group-relative optimization.
145
 
146
- ## How it works end to end
147
 
148
- 1. Start Pokemon Showdown locally in Colab.
149
- 2. Create the OpenEnv-style synchronous environment.
150
- 3. Format battle state into markdown.
151
- 4. Enumerate legal actions.
152
- 5. Generate one JSON action from the model.
153
- 6. Execute the action in the environment.
154
- 7. Receive next state, reward, done flag, and info.
155
- 8. Store rollout rows.
156
- 9. Train with GRPO on the collected rows.
157
 
158
- ## How to use
159
 
160
- ### Local package install
 
 
161
 
162
- From the repo root:
163
 
164
  ```bash
165
- python3 -m pip install -e .
166
  ```
167
 
168
- ### Colab training
169
 
170
- Open [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb) in Colab and run it top to bottom.
 
 
171
 
172
- The notebook does the following:
173
 
174
- 1. clones or uses the repo
175
- 2. installs the training stack
176
- 3. loads the model and LoRA adapter
177
- 4. starts a local Pokemon Showdown server
178
- 5. runs JSON warm-up SFT
179
- 6. collects rollout data from real battles
180
- 7. trains with GRPO
181
- 8. optionally saves the adapter to Hugging Face Hub
182
 
183
- ### Requirements
184
 
185
- - GPU runtime in Colab
186
- - local Pokemon Showdown server started from the notebook
187
- - Hugging Face token only if you want to push adapters
188
 
189
- ## Current status
190
 
191
- This repo now has a working end-to-end path where:
 
192
 
193
- - real battle rollouts are collected from the environment
194
- - valid action JSON is produced reliably after warm-up
195
- - GRPO can train on real rollout data in the non-quantized plain TRL path
196
 
197
- This is the basis for my hackathon demo and benchmark runs.
 
198
 
199
- ## Submission notes
200
 
201
- This repo is intended to be my clean hackathon submission repo.
202
 
203
- Linked artifacts to add before submission:
 
 
 
204
 
205
- - Hugging Face model repo: [https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1](https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1)
206
- - Hugging Face Space using OpenEnv stable release `0.2.1`
207
- - benchmark/results file
208
- - 1-minute demo video
 
2
  title: OpenEnv-WolfeClick Environment
3
  emoji: 🎮
4
  colorFrom: blue
5
+ colorTo: slate
6
  sdk: docker
7
  app_port: 8001
8
  tags:
9
  - openenv
10
  - pokemon
11
  - rl
12
+ - multi-agent
13
  ---
14
 
15
  # OpenEnv-WolfeClick
16
 
17
+ OpenEnv-WolfeClick is an OpenEnv-compatible environment for training LLMs in competitive Pokemon Showdown battles.
18
 
19
+ The core idea is simple: rock-paper-scissors already shows that cyclic matchups create nontrivial reasoning. Competitive Pokemon scales that into a much richer world with hidden information, constrained legal actions, long-term resource tradeoffs, and an active opponent. This repo turns that setting into a trainable environment with a clean `reset()` / `step()` loop, an OpenEnv server wrapper, and a Colab GRPO training workflow.
20
 
21
+ ## What is here
22
 
23
+ - `src/smogon_rl/`
24
+ - core environment logic, state formatting, action validation, reward shaping, and the poke-env client
25
+ - `env/`
26
+ - OpenEnv server package exposing the environment through `env.server.app:app`
27
+ - `trainer.ipynb`
28
+ - Colab notebook for rollout collection and GRPO training
29
+ - `watch_battle.ipynb`
30
+ - Colab notebook for running one live watched battle
31
+ - `benckmarks/benchmark.ipynb`
32
+ - quick checkpoint-vs-checkpoint benchmark notebook
33
+ - `openenv.yaml`
34
+ - OpenEnv entrypoint config
35
+ - `Dockerfile`
36
+ - HF Spaces / Docker deployment path for the OpenEnv server
 
 
 
 
37
 
38
  ## Environment design
39
 
40
+ Each turn, the model receives a structured markdown state containing:
 
 
 
 
41
 
42
  - active self Pokemon
43
  - active opponent Pokemon
44
+ - HP, status, item, ability, and stat modifiers
45
+ - full self roster and currently known moves
46
+ - revealed opponent history
47
+ - the exact legal actions for the turn
 
 
48
 
49
+ The model must output exactly one JSON action:
 
 
 
 
 
 
 
 
 
50
 
51
  ```json
52
  {"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
53
  ```
54
 
55
+ This keeps the interface concrete and legally grounded. The environment validates the action, executes it in a real Showdown battle, and returns the next state, reward, and episode metadata.
 
 
 
 
 
 
 
 
56
 
57
+ ## Reward
58
 
59
+ Rewards are shaped, but still tied to battle progress. The signal includes:
60
 
61
+ - damage dealt and damage taken
 
 
 
 
 
 
 
 
 
62
  - knockouts and faint penalties
63
  - healing value
64
  - setup value and opponent setup penalties
65
+ - passive damage and status effects
 
 
 
 
 
66
  - illegal action penalties
67
+ - small anti-stall / truncation penalties
 
 
 
 
 
 
 
68
 
69
+ The goal is to create a denser learning signal without turning the task into a toy proxy objective.
70
 
71
+ ## Training workflow
72
 
73
+ The training path is:
74
 
75
+ 1. start local Pokemon Showdown in Colab
76
+ 2. collect real rollout trajectories from live battles
77
+ 3. store prompt, chosen action, and environment reward
78
+ 4. train a LoRA adapter with GRPO on those real trajectories
79
+ 5. benchmark checkpoints against each other on the same env budget
80
 
81
+ The repo includes both the training notebook and a smaller watch notebook for running one live battle with a chosen checkpoint.
82
 
83
+ ## OpenEnv package
84
 
85
+ Yes, this repo includes a real OpenEnv environment package.
 
 
86
 
87
+ The deployable server lives at:
88
 
89
+ - `env/server/app.py`
90
+ - `env/server/environment.py`
91
+ - `env/models.py`
92
 
93
+ The OpenEnv config points to:
94
 
95
+ ```yaml
96
+ app: env.server.app:app
97
+ ```
98
 
99
+ That package wraps the local `PokemonShowdownEnv` and exposes it through the OpenEnv server interface.
 
 
100
 
101
+ ## Local usage
102
 
103
+ Install the local package:
104
 
105
+ ```bash
106
+ python3 -m pip install -e .
107
+ ```
 
 
 
 
 
 
108
 
109
+ Run a simple local episode:
110
 
111
+ ```bash
112
+ python3 examples/run_single_episode.py
113
+ ```
114
 
115
+ Run one watched model battle:
116
 
117
  ```bash
118
+ python3 examples/watch_model_battle.py --revision grpo-qwen3-4b-run2
119
  ```
120
 
121
+ ## Colab usage
122
 
123
+ - `trainer.ipynb`: collect rollouts and train with GRPO
124
+ - `watch_battle.ipynb`: start Showdown, load a checkpoint, and run one live battle
125
+ - `benckmarks/benchmark.ipynb`: compare checkpoints quickly
126
 
127
+ These notebooks assume a GPU runtime for model inference/training.
128
 
129
+ ## Deployment
 
 
 
 
 
 
 
130
 
131
+ The repo includes the files needed for an OpenEnv-style deployment:
132
 
133
+ - `openenv.yaml`
134
+ - `Dockerfile`
135
+ - `env/` package
136
 
137
+ The Docker image starts:
138
 
139
+ - local Pokemon Showdown on port `8000`
140
+ - the OpenEnv FastAPI server on port `8001`
141
 
142
+ ## Current artifacts
 
 
143
 
144
+ - HF model repo: `Atharva2099/openenv-smogon-rl`
145
+ - adapter revisions: `grpo-qwen3-4b-run1`, `grpo-qwen3-4b-run2`
146
 
147
+ ## Status
148
 
149
+ This is a working end-to-end environment and training repo:
150
 
151
+ - live battle rollouts work
152
+ - GRPO training on real trajectories works
153
+ - checkpoint benchmarking works
154
+ - the OpenEnv server package exists in-repo
155
 
156
+ The next useful polish steps are HF Spaces deployment validation, README/blog cleanup, and a short write-up/demo around the environment design and results.
 
 
 
env/__init__.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ OpenEnv-compatible environment package for the WolfeClick Pokemon RL env.
3
+
4
+ Exports the model types so users can do:
5
+
6
+ from env import WolfeClickAction, WolfeClickObservation, WolfeClickState
7
+ """
8
+
9
+ from .models import WolfeClickAction, WolfeClickObservation, WolfeClickState
10
+
11
+ __all__ = ["WolfeClickAction", "WolfeClickObservation", "WolfeClickState"]
12
+
13
+
env/models.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass
4
+ from typing import Any, Dict
5
+
6
+ from openenv.core.env_server import Action, Observation, State
7
+
8
+
9
+ @dataclass
10
+ class WolfeClickAction(Action):
11
+ """Single step action for the environment.
12
+
13
+ This wraps the constrained JSON interface already used by the local env:
14
+
15
+ {"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
16
+ """
17
+
18
+ action_json: str
19
+
20
+
21
+ @dataclass
22
+ class WolfeClickObservation(Observation):
23
+ """Markdown battle state plus metadata."""
24
+
25
+ state_markdown: str
26
+ # Whether the episode is finished (battle over or truncated).
27
+ done: bool = False
28
+ # Shaped reward from the environment.
29
+ reward: float = 0.0
30
+ # Free-form metadata mirrored from the underlying env's info dict.
31
+ metadata: Dict[str, Any] | None = None
32
+
33
+
34
+ class WolfeClickState(State):
35
+ """Thin wrapper around the core State model."""
36
+
37
+ # We rely on the base State fields (episode_id, step_count).
38
+ # Any extra per-episode bookkeeping can live on the environment itself.
39
+ pass
40
+
env/server/__init__.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ """Server package for the WolfeClick OpenEnv environment."""
2
+
env/server/app.py ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from openenv.core.env_server import create_app
4
+
5
+ from env.models import WolfeClickAction, WolfeClickObservation
6
+ from env.server.environment import WolfeClickEnvironment
7
+
8
+ app = create_app(
9
+ WolfeClickEnvironment,
10
+ WolfeClickAction,
11
+ WolfeClickObservation,
12
+ env_name="openenv-wolfeclick",
13
+ )
14
+
15
+
16
+ def main() -> None:
17
+ import uvicorn
18
+
19
+ uvicorn.run(app, host="0.0.0.0", port=8001)
20
+
21
+
22
+ if __name__ == "__main__":
23
+ main()
24
+
env/server/environment.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import uuid
4
+ from typing import Any, Dict
5
+
6
+ from openenv.core.env_server import Environment
7
+
8
+ from env.models import WolfeClickAction, WolfeClickObservation, WolfeClickState
9
+ from smogon_rl.config import EnvConfig
10
+ from smogon_rl.openenv_sync_env import PokemonShowdownEnv
11
+
12
+
13
+ class WolfeClickEnvironment(Environment[WolfeClickAction, WolfeClickObservation, WolfeClickState]):
14
+ """OpenEnv server wrapper around the local PokemonShowdownEnv."""
15
+
16
+ def __init__(self) -> None:
17
+ super().__init__()
18
+ self._env = PokemonShowdownEnv(config=EnvConfig())
19
+ # Underlying State tracks episode_id and step_count; we keep a separate battle counter.
20
+ self._state = WolfeClickState(episode_id=str(uuid.uuid4()), step_count=0)
21
+ self._battle_index: int = 0
22
+
23
+ def reset(self, **kwargs: Any) -> WolfeClickObservation:
24
+ """Start a new battle and return the initial observation."""
25
+ self._battle_index += 1
26
+ self._state = WolfeClickState(episode_id=str(uuid.uuid4()), step_count=0)
27
+ state_str = self._env.reset()
28
+ return WolfeClickObservation(
29
+ state_markdown=state_str,
30
+ done=False,
31
+ reward=0.0,
32
+ metadata={"battle_index": self._battle_index},
33
+ )
34
+
35
+ def step(self, action: WolfeClickAction, **kwargs: Any) -> WolfeClickObservation:
36
+ """Apply one JSON action and return the next observation."""
37
+ self._state.step_count += 1 # type: ignore[attr-defined]
38
+ obs_str, reward, done, info = self._env.step(action.action_json)
39
+ return WolfeClickObservation(
40
+ state_markdown=obs_str,
41
+ done=bool(done),
42
+ reward=float(reward),
43
+ metadata=info or {},
44
+ )
45
+
46
+ @property
47
+ def state(self) -> WolfeClickState:
48
+ return self._state
49
+