Spaces:

Atharva2099
/

WolfeClick

Sleeping

App Files Files Community

Atharva commited on Mar 8

Commit

846472c

1 Parent(s): ca70d21

Track OpenEnv package and refresh docs

Browse files

Files changed (7) hide show

.gitignore +0 -1
README.md +85 -137
env/__init__.py +13 -0
env/models.py +40 -0
env/server/__init__.py +2 -0
env/server/app.py +24 -0
env/server/environment.py +49 -0

.gitignore CHANGED Viewed

@@ -14,7 +14,6 @@ build/
 # Virtual environments
 .venv/
 venv/
-env/
 # IDE / OS
 .idea/

 # Virtual environments
 .venv/
 venv/
 # IDE / OS
 .idea/

README.md CHANGED Viewed

@@ -2,207 +2,155 @@
 title: OpenEnv-WolfeClick Environment
 emoji: 🎮
 colorFrom: blue
-colorTo: purple
 sdk: docker
 app_port: 8001
 tags:
  - openenv
  - pokemon
  - rl
 ---
 # OpenEnv-WolfeClick
-OpenEnv-WolfeClick is a reinforcement learning environment and training workflow for competitive Pokemon battles with large language models.
-The project was built for the OpenEnv hackathon to answer a specific question: can an LLM learn to act in a partially observable, adversarial, long-horizon environment where legal actions are constrained, rewards are delayed, and the opponent is another agent?
-This repo focuses on that environment and a minimal Colab training path.
-## Why I Built This
-Pokemon battles are a strong multi-agent training environment for LLMs because they require:
-- hidden information and opponent modeling
-- long-horizon planning over many turns
-- legal action grounding under a constrained action space
-- adapting to a changing world state after every action
-- balancing local rewards against later consequences
-I built this environment to make those properties trainable with a simple `reset()` / `step()` loop and a small JSON action interface.
-## What is in this repo
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl): environment, state formatting, action space, reward shaping, and client code
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb): main Colab notebook for warm-up SFT, rollout collection, and GRPO training
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples): small local examples
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml): package metadata
 ## Environment design
-### State design
-The state is not a raw simulator dump. It is a structured markdown representation designed to preserve strategic information while remaining readable to an LLM.
-Each prompt includes:
 - active self Pokemon
 - active opponent Pokemon
-- HP, status, ability, item, and current stat modifiers
-- full self team roster with currently known moves
-- opponent history and revealed information
-- exact legal actions available this turn
-This is implemented through the environment wrapper and state formatter:
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py)
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py)
-My design goal was to expose enough information for strategic decisions without giving the model shortcuts that bypass the game structure.
-### Action design
-The action space is deliberately constrained.
-The model must emit exactly one JSON object:
 ```json
 {"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
 ```
-At every step, legal actions are enumerated from the current battle state using:
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py)
-This module does three important things:
-- enumerates legal moves and switches for the turn
-- builds the action instruction block shown to the model
-- validates model outputs against the legal action set
-This matters because I do not want the model to “sort of” describe an action. I want the environment to enforce a concrete legal interface.
-### Reward design
-The environment reward is shaped but still tied to battle outcomes.
-Reward computation lives in:
-- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py)
-The reward includes:
-- damage dealt to the opponent
-- damage taken by the agent
 - knockouts and faint penalties
 - healing value
 - setup value and opponent setup penalties
-- passive damage value
-- status penalties
-The environment wrapper in [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py) adds practical rollout constraints:
-- illegal action fallback handling
 - illegal action penalties
-- anti-stall living penalty
-- battle length caps
-- no-progress termination penalties
-This separation is intentional:
-- `reward.py` captures battle-quality shaping
-- the env wrapper handles rollout hygiene and training throughput
-## Training design
-### 1. Warm-up SFT
-The notebook begins with a supervised warm-up stage so the model learns to emit valid action JSON for the battle-state prompt format.
-This does not claim strategic mastery. It only ensures the model is good enough to participate in the environment without collapsing into malformed outputs.
-### 2. Real rollout collection
-The policy is then run in real Pokemon Showdown battles. For each turn, the notebook stores:
-- `prompt`
-- `collected_action`
-- `collected_reward`
-This makes the rollout data usable for GRPO training while preserving the exact environment reward signal.
-### 3. GRPO training
-The GRPO reward used in the notebook is a wrapper around the stored rollout reward.
-It is designed to preserve ranking pressure inside a completion group:
-- malformed output is penalized strongly
-- valid but different actions are penalized lightly
-- the action matching the executed rollout action receives the collected environment reward plus a positive margin
-That matters because raw rollout rewards alone do not always create a clean learning signal for group-relative optimization.
-## How it works end to end
-1. Start Pokemon Showdown locally in Colab.
-2. Create the OpenEnv-style synchronous environment.
-3. Format battle state into markdown.
-4. Enumerate legal actions.
-5. Generate one JSON action from the model.
-6. Execute the action in the environment.
-7. Receive next state, reward, done flag, and info.
-8. Store rollout rows.
-9. Train with GRPO on the collected rows.
-## How to use
-### Local package install
-From the repo root:
 ```bash
-python3 -m pip install -e .
 ```
-### Colab training
-Open [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb) in Colab and run it top to bottom.
-The notebook does the following:
-1. clones or uses the repo
-2. installs the training stack
-3. loads the model and LoRA adapter
-4. starts a local Pokemon Showdown server
-5. runs JSON warm-up SFT
-6. collects rollout data from real battles
-7. trains with GRPO
-8. optionally saves the adapter to Hugging Face Hub
-### Requirements
-- GPU runtime in Colab
-- local Pokemon Showdown server started from the notebook
-- Hugging Face token only if you want to push adapters
-## Current status
-This repo now has a working end-to-end path where:
-- real battle rollouts are collected from the environment
-- valid action JSON is produced reliably after warm-up
-- GRPO can train on real rollout data in the non-quantized plain TRL path
-This is the basis for my hackathon demo and benchmark runs.
-## Submission notes
-This repo is intended to be my clean hackathon submission repo.
-Linked artifacts to add before submission:
-- Hugging Face model repo: [https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1](https://huggingface.co/Atharva2099/openenv-smogon-rl/tree/grpo-qwen3-4b-run1)
-- Hugging Face Space using OpenEnv stable release `0.2.1`
-- benchmark/results file
-- 1-minute demo video

 title: OpenEnv-WolfeClick Environment
 emoji: 🎮
 colorFrom: blue
+colorTo: slate
 sdk: docker
 app_port: 8001
 tags:
  - openenv
  - pokemon
  - rl
+ - multi-agent
 ---
 # OpenEnv-WolfeClick
+OpenEnv-WolfeClick is an OpenEnv-compatible environment for training LLMs in competitive Pokemon Showdown battles.
+The core idea is simple: rock-paper-scissors already shows that cyclic matchups create nontrivial reasoning. Competitive Pokemon scales that into a much richer world with hidden information, constrained legal actions, long-term resource tradeoffs, and an active opponent. This repo turns that setting into a trainable environment with a clean `reset()` / `step()` loop, an OpenEnv server wrapper, and a Colab GRPO training workflow.
+## What is here
+- `src/smogon_rl/`
+  - core environment logic, state formatting, action validation, reward shaping, and the poke-env client
+- `env/`
+  - OpenEnv server package exposing the environment through `env.server.app:app`
+- `trainer.ipynb`
+  - Colab notebook for rollout collection and GRPO training
+- `watch_battle.ipynb`
+  - Colab notebook for running one live watched battle
+- `benckmarks/benchmark.ipynb`
+  - quick checkpoint-vs-checkpoint benchmark notebook
+- `openenv.yaml`
+  - OpenEnv entrypoint config
+- `Dockerfile`
+  - HF Spaces / Docker deployment path for the OpenEnv server
 ## Environment design
+Each turn, the model receives a structured markdown state containing:
 - active self Pokemon
 - active opponent Pokemon
+- HP, status, item, ability, and stat modifiers
+- full self roster and currently known moves
+- revealed opponent history
+- the exact legal actions for the turn
+The model must output exactly one JSON action:
 ```json
 {"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
 ```
+This keeps the interface concrete and legally grounded. The environment validates the action, executes it in a real Showdown battle, and returns the next state, reward, and episode metadata.
+## Reward
+Rewards are shaped, but still tied to battle progress. The signal includes:
+- damage dealt and damage taken
 - knockouts and faint penalties
 - healing value
 - setup value and opponent setup penalties
+- passive damage and status effects
 - illegal action penalties
+- small anti-stall / truncation penalties
+The goal is to create a denser learning signal without turning the task into a toy proxy objective.
+## Training workflow
+The training path is:
+1. start local Pokemon Showdown in Colab
+2. collect real rollout trajectories from live battles
+3. store prompt, chosen action, and environment reward
+4. train a LoRA adapter with GRPO on those real trajectories
+5. benchmark checkpoints against each other on the same env budget
+The repo includes both the training notebook and a smaller watch notebook for running one live battle with a chosen checkpoint.
+## OpenEnv package
+Yes, this repo includes a real OpenEnv environment package.
+The deployable server lives at:
+- `env/server/app.py`
+- `env/server/environment.py`
+- `env/models.py`
+The OpenEnv config points to:
+```yaml
+app: env.server.app:app
+```
+That package wraps the local `PokemonShowdownEnv` and exposes it through the OpenEnv server interface.
+## Local usage
+Install the local package:
+```bash
+python3 -m pip install -e .
+```
+Run a simple local episode:
+```bash
+python3 examples/run_single_episode.py
+```
+Run one watched model battle:
 ```bash
+python3 examples/watch_model_battle.py --revision grpo-qwen3-4b-run2
 ```
+## Colab usage
+- `trainer.ipynb`: collect rollouts and train with GRPO
+- `watch_battle.ipynb`: start Showdown, load a checkpoint, and run one live battle
+- `benckmarks/benchmark.ipynb`: compare checkpoints quickly
+These notebooks assume a GPU runtime for model inference/training.
+## Deployment
+The repo includes the files needed for an OpenEnv-style deployment:
+- `openenv.yaml`
+- `Dockerfile`
+- `env/` package
+The Docker image starts:
+- local Pokemon Showdown on port `8000`
+- the OpenEnv FastAPI server on port `8001`
+## Current artifacts
+- HF model repo: `Atharva2099/openenv-smogon-rl`
+- adapter revisions: `grpo-qwen3-4b-run1`, `grpo-qwen3-4b-run2`
+## Status
+This is a working end-to-end environment and training repo:
+- live battle rollouts work
+- GRPO training on real trajectories works
+- checkpoint benchmarking works
+- the OpenEnv server package exists in-repo
+The next useful polish steps are HF Spaces deployment validation, README/blog cleanup, and a short write-up/demo around the environment design and results.

env/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""
+OpenEnv-compatible environment package for the WolfeClick Pokemon RL env.
+Exports the model types so users can do:
+    from env import WolfeClickAction, WolfeClickObservation, WolfeClickState
+"""
+from .models import WolfeClickAction, WolfeClickObservation, WolfeClickState
+__all__ = ["WolfeClickAction", "WolfeClickObservation", "WolfeClickState"]

env/models.py ADDED Viewed

	@@ -0,0 +1,40 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any, Dict
+from openenv.core.env_server import Action, Observation, State
+@dataclass
+class WolfeClickAction(Action):
+    """Single step action for the environment.
+    This wraps the constrained JSON interface already used by the local env:
+        {"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
+    """
+    action_json: str
+@dataclass
+class WolfeClickObservation(Observation):
+    """Markdown battle state plus metadata."""
+    state_markdown: str
+    # Whether the episode is finished (battle over or truncated).
+    done: bool = False
+    # Shaped reward from the environment.
+    reward: float = 0.0
+    # Free-form metadata mirrored from the underlying env's info dict.
+    metadata: Dict[str, Any] | None = None
+class WolfeClickState(State):
+    """Thin wrapper around the core State model."""
+    # We rely on the base State fields (episode_id, step_count).
+    # Any extra per-episode bookkeeping can live on the environment itself.
+    pass

env/server/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ """Server package for the WolfeClick OpenEnv environment."""
2	+

env/server/app.py ADDED Viewed

	@@ -0,0 +1,24 @@

+from __future__ import annotations
+from openenv.core.env_server import create_app
+from env.models import WolfeClickAction, WolfeClickObservation
+from env.server.environment import WolfeClickEnvironment
+app = create_app(
+    WolfeClickEnvironment,
+    WolfeClickAction,
+    WolfeClickObservation,
+    env_name="openenv-wolfeclick",
+)
+def main() -> None:
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8001)
+if __name__ == "__main__":
+    main()

env/server/environment.py ADDED Viewed

	@@ -0,0 +1,49 @@

+from __future__ import annotations
+import uuid
+from typing import Any, Dict
+from openenv.core.env_server import Environment
+from env.models import WolfeClickAction, WolfeClickObservation, WolfeClickState
+from smogon_rl.config import EnvConfig
+from smogon_rl.openenv_sync_env import PokemonShowdownEnv
+class WolfeClickEnvironment(Environment[WolfeClickAction, WolfeClickObservation, WolfeClickState]):
+    """OpenEnv server wrapper around the local PokemonShowdownEnv."""
+    def __init__(self) -> None:
+        super().__init__()
+        self._env = PokemonShowdownEnv(config=EnvConfig())
+        # Underlying State tracks episode_id and step_count; we keep a separate battle counter.
+        self._state = WolfeClickState(episode_id=str(uuid.uuid4()), step_count=0)
+        self._battle_index: int = 0
+    def reset(self, **kwargs: Any) -> WolfeClickObservation:
+        """Start a new battle and return the initial observation."""
+        self._battle_index += 1
+        self._state = WolfeClickState(episode_id=str(uuid.uuid4()), step_count=0)
+        state_str = self._env.reset()
+        return WolfeClickObservation(
+            state_markdown=state_str,
+            done=False,
+            reward=0.0,
+            metadata={"battle_index": self._battle_index},
+        )
+    def step(self, action: WolfeClickAction, **kwargs: Any) -> WolfeClickObservation:
+        """Apply one JSON action and return the next observation."""
+        self._state.step_count += 1  # type: ignore[attr-defined]
+        obs_str, reward, done, info = self._env.step(action.action_json)
+        return WolfeClickObservation(
+            state_markdown=obs_str,
+            done=bool(done),
+            reward=float(reward),
+            metadata=info or {},
+        )
+    @property
+    def state(self) -> WolfeClickState:
+        return self._state