Spaces:

Atharva2099
/

WolfeClick

Sleeping

App Files Files Community

Atharva commited on Mar 8

Commit

a8df3de

0 Parent(s):

Initial hackathon submission export

Browse files

Files changed (12) hide show

.gitignore +21 -0
README.md +195 -0
examples/run_single_episode.py +52 -0
pyproject.toml +30 -0
src/smogon_rl/__init__.py +13 -0
src/smogon_rl/action_space.py +138 -0
src/smogon_rl/config.py +28 -0
src/smogon_rl/openenv_sync_env.py +260 -0
src/smogon_rl/pokeenv_client.py +304 -0
src/smogon_rl/reward.py +320 -0
src/smogon_rl/state_formatter.py +181 -0
trainer.ipynb +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,21 @@

+# Environment and secrets — never commit
+.env
+.env.local
+*.env
+# Python
+__pycache__/
+*.py[cod]
+*.egg-info/
+.eggs/
+dist/
+build/
+# Virtual environments
+.venv/
+venv/
+env/
+# IDE / OS
+.idea/
+.DS_Store

README.md ADDED Viewed

	@@ -0,0 +1,195 @@

+# OpenEnv-WolfeClick
+OpenEnv-WolfeClick is a reinforcement learning environment and training workflow for competitive Pokemon battles with large language models.
+The project was built for the OpenEnv hackathon to answer a specific question: can an LLM learn to act in a partially observable, adversarial, long-horizon environment where legal actions are constrained, rewards are delayed, and the opponent is another agent?
+This repo focuses on that environment and a minimal Colab training path.
+## Why I Built This
+Pokemon battles are a strong multi-agent training environment for LLMs because they require:
+- hidden information and opponent modeling
+- long-horizon planning over many turns
+- legal action grounding under a constrained action space
+- adapting to a changing world state after every action
+- balancing local rewards against later consequences
+I built this environment to make those properties trainable with a simple `reset()` / `step()` loop and a small JSON action interface.
+## What is in this repo
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl): environment, state formatting, action space, reward shaping, and client code
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb): main Colab notebook for warm-up SFT, rollout collection, and GRPO training
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/examples): small local examples
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/pyproject.toml): package metadata
+## Environment design
+### State design
+The state is not a raw simulator dump. It is a structured markdown representation designed to preserve strategic information while remaining readable to an LLM.
+Each prompt includes:
+- active self Pokemon
+- active opponent Pokemon
+- HP, status, ability, item, and current stat modifiers
+- full self team roster with currently known moves
+- opponent history and revealed information
+- exact legal actions available this turn
+This is implemented through the environment wrapper and state formatter:
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py)
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/state_formatter.py)
+My design goal was to expose enough information for strategic decisions without giving the model shortcuts that bypass the game structure.
+### Action design
+The action space is deliberately constrained.
+The model must emit exactly one JSON object:
+```json
+{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}
+```
+At every step, legal actions are enumerated from the current battle state using:
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/action_space.py)
+This module does three important things:
+- enumerates legal moves and switches for the turn
+- builds the action instruction block shown to the model
+- validates model outputs against the legal action set
+This matters because I do not want the model to “sort of” describe an action. I want the environment to enforce a concrete legal interface.
+### Reward design
+The environment reward is shaped but still tied to battle outcomes.
+Reward computation lives in:
+- [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/reward.py)
+The reward includes:
+- damage dealt to the opponent
+- damage taken by the agent
+- knockouts and faint penalties
+- healing value
+- setup value and opponent setup penalties
+- passive damage value
+- status penalties
+The environment wrapper in [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/src/smogon_rl/openenv_sync_env.py) adds practical rollout constraints:
+- illegal action fallback handling
+- illegal action penalties
+- anti-stall living penalty
+- battle length caps
+- no-progress termination penalties
+This separation is intentional:
+- `reward.py` captures battle-quality shaping
+- the env wrapper handles rollout hygiene and training throughput
+## Training design
+### 1. Warm-up SFT
+The notebook begins with a supervised warm-up stage so the model learns to emit valid action JSON for the battle-state prompt format.
+This does not claim strategic mastery. It only ensures the model is good enough to participate in the environment without collapsing into malformed outputs.
+### 2. Real rollout collection
+The policy is then run in real Pokemon Showdown battles. For each turn, the notebook stores:
+- `prompt`
+- `collected_action`
+- `collected_reward`
+This makes the rollout data usable for GRPO training while preserving the exact environment reward signal.
+### 3. GRPO training
+The GRPO reward used in the notebook is a wrapper around the stored rollout reward.
+It is designed to preserve ranking pressure inside a completion group:
+- malformed output is penalized strongly
+- valid but different actions are penalized lightly
+- the action matching the executed rollout action receives the collected environment reward plus a positive margin
+That matters because raw rollout rewards alone do not always create a clean learning signal for group-relative optimization.
+## How it works end to end
+1. Start Pokemon Showdown locally in Colab.
+2. Create the OpenEnv-style synchronous environment.
+3. Format battle state into markdown.
+4. Enumerate legal actions.
+5. Generate one JSON action from the model.
+6. Execute the action in the environment.
+7. Receive next state, reward, done flag, and info.
+8. Store rollout rows.
+9. Train with GRPO on the collected rows.
+## How to use
+### Local package install
+From the repo root:
+```bash
+python3 -m pip install -e .
+```
+### Colab training
+Open [`/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb`](/Users/atharva/Desktop/Projects/OpenEnv-WolfeClick/trainer.ipynb) in Colab and run it top to bottom.
+The notebook does the following:
+1. clones or uses the repo
+2. installs the training stack
+3. loads the model and LoRA adapter
+4. starts a local Pokemon Showdown server
+5. runs JSON warm-up SFT
+6. collects rollout data from real battles
+7. trains with GRPO
+8. optionally saves the adapter to Hugging Face Hub
+### Requirements
+- GPU runtime in Colab
+- local Pokemon Showdown server started from the notebook
+- Hugging Face token only if you want to push adapters
+## Current status
+This repo now has a working end-to-end path where:
+- real battle rollouts are collected from the environment
+- valid action JSON is produced reliably after warm-up
+- GRPO can train on real rollout data in the non-quantized plain TRL path
+This is the basis for my hackathon demo and benchmark runs.
+## Submission notes
+This repo is intended to be my clean hackathon submission repo.
+Linked artifacts to add before submission:
+- Hugging Face model repo
+- Hugging Face Space using OpenEnv stable release `0.2.1`
+- benchmark/results file
+- 1-minute demo video

examples/run_single_episode.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from __future__ import annotations
+import json
+from pathlib import Path
+from smogon_rl.action_space import ActionOption, enumerate_actions
+from smogon_rl.config import EnvConfig
+from smogon_rl.openenv_sync_env import PokemonShowdownEnv
+def main() -> None:
+    config = EnvConfig()
+    env = PokemonShowdownEnv(config=config)
+    print("Starting a single gen4randombattle episode.")
+    obs = env.reset()
+    print("Initial state (truncated):")
+    print("\n".join(obs.splitlines()[:40]))
+    done = False
+    total_reward = 0.0
+    step_idx = 0
+    while not done and step_idx < config.max_steps_per_battle:
+        step_idx += 1
+        print(f"\n=== Step {step_idx} ===")
+        # Naive policy: query valid actions from the environment and always pick
+        # the first one. A real agent would send `obs` and `info["instructions"]`
+        # to an LLM and use its JSON response here.
+        battle = env._ensure_battle()  # type: ignore[attr-defined]
+        valid_actions = enumerate_actions(battle)
+        if not valid_actions:
+            print("No valid actions available; terminating.")
+            break
+        chosen: ActionOption = valid_actions[0]
+        action_json = {"action": chosen.action_type, "choice": chosen.choice}
+        obs, reward, done, info = env.step(json.dumps(action_json))
+        total_reward += reward
+        print(f"Chosen action: {action_json}")
+        print(f"Reward: {reward:.3f}, Done: {done}")
+        print("State (truncated):")
+        print("\n".join(obs.splitlines()[:20]))
+    print(f"\nTotal reward: {total_reward}")
+if __name__ == "__main__":
+    main()

pyproject.toml ADDED Viewed

	@@ -0,0 +1,30 @@

+[project]
+name = "smogon-rl"
+version = "0.1.0"
+description = "Theory-of-Mind Pokémon RL environment using poke-env and OpenEnv."
+readme = "README.md"
+requires-python = ">=3.10"
+authors = [
+  { name = "Atharva" }
+]
+dependencies = [
+  "poke-env>=0.8.0,<0.9.0",
+  "numpy>=1.24.0",
+  "pydantic>=2.0.0",
+]
+[project.optional-dependencies]
+dev = [
+  "pytest>=7.0.0",
+  "ruff>=0.5.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.uv]
+package = "smogon-rl"
+[tool.uv.sources]

src/smogon_rl/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""
+Smogon-RL core package.
+This package provides:
+- An async poke-env client for Pokémon Showdown battles.
+- A synchronous, OpenEnv-style wrapper exposing reset/step.
+- State formatting, action space handling, and reward shaping utilities.
+"""
+from .config import DEFAULT_BATTLE_FORMAT
+__all__ = ["DEFAULT_BATTLE_FORMAT"]

src/smogon_rl/action_space.py ADDED Viewed

	@@ -0,0 +1,138 @@

+from __future__ import annotations
+import json
+import re
+from dataclasses import dataclass
+from typing import List, Literal, Optional
+from pydantic import BaseModel, ValidationError
+from poke_env.environment.battle import Battle
+from poke_env.environment.move import Move
+from poke_env.environment.pokemon import Pokemon
+# Match a single JSON object with "action" and "choice" (handles <think>...</think> + JSON).
+_ACTION_JSON_RE = re.compile(
+    r'\{\s*"action"\s*:\s*"(?:move|switch)"\s*,\s*"choice"\s*:\s*"[^"]*"\s*\}',
+    re.IGNORECASE,
+)
+ActionType = Literal["move", "switch"]
+@dataclass
+class ActionOption:
+    """Concrete action option available in the current state."""
+    action_type: ActionType
+    choice: str
+    move: Optional[Move] = None
+    pokemon: Optional[Pokemon] = None
+class ActionJSON(BaseModel):
+    """Strict JSON schema the LLM must output."""
+    action: ActionType
+    choice: str
+def enumerate_actions(battle: Battle) -> List[ActionOption]:
+    """Enumerate up to 4 moves and up to 5 switches for the current state."""
+    options: List[ActionOption] = []
+    # Moves
+    for move in battle.available_moves[:4]:
+        if getattr(move, "current_pp", 1) <= 0:
+            continue
+        choice = move.id
+        options.append(ActionOption(action_type="move", choice=choice, move=move))
+    # Switches
+    for pokemon in battle.available_switches[:5]:
+        if pokemon.fainted:
+            continue
+        choice = pokemon.species or pokemon.nickname or "Unknown"
+        options.append(
+            ActionOption(action_type="switch", choice=choice, pokemon=pokemon)
+        )
+    return options
+def _normalize_choice(s: str) -> str:
+    """Normalize choice for comparison: lowercase, spaces to hyphens (matches poke-env move ids)."""
+    return s.strip().lower().replace(" ", "-")
+def extract_action_json_from_text(text: str) -> Optional[str]:
+    """Extract a single action JSON object from model output that may contain thinking or prose.
+    Strips think tags first, then looks for our schema in the remainder (or in the full string).
+    Returns the first matching JSON substring, or None if none found.
+    """
+    if not text or not text.strip():
+        return None
+    # Strip think blocks first so we prefer content after thinking.
+    stripped = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
+    for candidate in (stripped, text):
+        match = _ACTION_JSON_RE.search(candidate)
+        if match:
+            return match.group(0)
+    return None
+def parse_llm_action(raw_output: str, valid_actions: List[ActionOption]) -> ActionJSON:
+    """Parse and validate the LLM JSON output against the current action set.
+    The model must output:
+    {
+        "action": "move" | "switch",
+        "choice": "Exact Name of Move or Pokemon"
+    }
+    Choice matching is case-insensitive and normalizes spaces to hyphens so
+    "Flamethrower" and "Thunder Wave" match env ids "flamethrower" and "thunder-wave".
+    """
+    try:
+        payload = json.loads(raw_output)
+    except json.JSONDecodeError as exc:
+        raise ValueError(f"Model output is not valid JSON: {exc}") from exc
+    try:
+        action = ActionJSON.model_validate(payload)
+    except ValidationError as exc:
+        raise ValueError(f"Model JSON does not match schema: {exc}") from exc
+    want_norm = _normalize_choice(action.choice)
+    matched = None
+    for a in valid_actions:
+        if a.action_type != action.action:
+            continue
+        if _normalize_choice(a.choice) == want_norm:
+            matched = a
+            break
+    if matched is None:
+        valid_desc = [
+            {"action": a.action_type, "choice": a.choice} for a in valid_actions
+        ]
+        raise ValueError(
+            f"Invalid action selection {action.model_dump()}. "
+            f"Valid options are: {valid_desc}"
+        )
+    # Return with the env's exact choice string so downstream uses the right id.
+    return ActionJSON(action=action.action, choice=matched.choice)
+def build_action_instructions(valid_actions: List[ActionOption]) -> str:
+    """Build a short instruction string describing the JSON schema and options."""
+    lines = [
+        "You must choose exactly one action and output pure JSON with this schema:",
+        "",
+        '{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}',
+        "",
+        "Valid options for this state:",
+    ]
+    for opt in valid_actions:
+        lines.append(f"- action: {opt.action_type!r}, choice: {opt.choice!r}")
+    return "\n".join(lines)

src/smogon_rl/config.py ADDED Viewed

	@@ -0,0 +1,28 @@

+from __future__ import annotations
+from dataclasses import dataclass
+DEFAULT_BATTLE_FORMAT = "gen4randombattle"
+@dataclass
+class EnvConfig:
+    """Configuration for the Pokémon RL environment."""
+    battle_format: str = DEFAULT_BATTLE_FORMAT
+    # Hard cap to prevent very long battles from dominating rollout wall-time.
+    max_steps_per_battle: int = 30
+    poll_interval_seconds: float = 0.2
+    open_timeout: float = 25.0
+    show_replays: bool = False
+    verbose_logging: bool = False
+    log_every_n_steps: int = 25
+    poll_heartbeat_seconds: float = 5.0
+    min_battle_reward: float = -100.0
+    max_no_progress_steps: int = 2
+    # Small per-step time penalty to bias toward faster, decisive games.
+    step_living_penalty: float = -0.05
+    # Additional truncation/timeout penalties.
+    no_progress_termination_penalty: float = -1.0
+    max_steps_termination_penalty: float = -2.0

src/smogon_rl/openenv_sync_env.py ADDED Viewed

	@@ -0,0 +1,260 @@

+from __future__ import annotations
+import json
+from dataclasses import dataclass, field
+from typing import Any, Dict, Optional, Tuple
+from poke_env.environment.battle import Battle
+from poke_env.player.player import Player
+from .action_space import (
+    ActionJSON,
+    ActionOption,
+    build_action_instructions,
+    enumerate_actions,
+    extract_action_json_from_text,
+    parse_llm_action,
+)
+from .config import EnvConfig
+from .pokeenv_client import PokeEnvClient
+from .reward import (
+    BattleStateSummary,
+    ILLEGAL_ACTION_PENALTY,
+    RewardTrackingState,
+    calculate_reward,
+    count_new_passive_hits_for_turn,
+    summarize_battle_state,
+)
+from .state_formatter import OpponentHistoryTracker, format_battle_state
+@dataclass
+class PokemonShowdownEnv:
+    """Synchronous, OpenEnv-style wrapper around a poke-env battle.
+    The environment exposes a simple Gymnasium-like / OpenEnv-like API:
+        obs = env.reset()
+        obs, reward, done, info = env.step(action_json_str)
+    where `action_json_str` is a JSON string describing a move or switch using
+    the constrained 9-action space.
+    """
+    config: EnvConfig = field(default_factory=EnvConfig)
+    _client: PokeEnvClient = field(init=False)
+    _opponent_history: OpponentHistoryTracker = field(init=False)
+    _reward_trackers: RewardTrackingState = field(init=False)
+    _prev_state: Optional[BattleStateSummary] = field(init=False, default=None)
+    _steps_this_battle: int = field(init=False, default=0)
+    # Running total of passive hits — updated O(k) per step via the single-turn
+    # scanner, never by re-scanning the full observation history.
+    _cumulative_passive_hits: int = field(init=False, default=0)
+    _battle_index: int = field(init=False, default=0)
+    _battle_reward_total: float = field(init=False, default=0.0)
+    _no_progress_steps: int = field(init=False, default=0)
+    def __post_init__(self) -> None:
+        self._client = PokeEnvClient(config=self.config)
+        self._opponent_history = OpponentHistoryTracker()
+        self._reward_trackers = RewardTrackingState()
+    def _log(self, message: str) -> None:
+        if self.config.verbose_logging:
+            print(f"[PokemonShowdownEnv] {message}", flush=True)
+    # ------------------------------------------------------------------ API
+    def reset(self) -> str:
+        """Start a new battle and return the initial markdown state."""
+        self._battle_index += 1
+        self._client.start_new_battle()
+        self._opponent_history = OpponentHistoryTracker()
+        self._reward_trackers = RewardTrackingState()
+        self._steps_this_battle = 0
+        self._cumulative_passive_hits = 0
+        self._battle_reward_total = 0.0
+        self._no_progress_steps = 0
+        battle = self._wait_for_battle_or_raise()
+        self._log(
+            f"Battle {self._battle_index} started at turn={battle.turn} "
+            f"(format={self.config.battle_format})."
+        )
+        self._prev_state = summarize_battle_state(battle, self._cumulative_passive_hits)
+        return format_battle_state(battle, self._opponent_history)
+    def step(self, action_json: str | Dict[str, Any]) -> Tuple[str, float, bool, Dict[str, Any]]:
+        """Apply one action and return (state_str, reward, done, info)."""
+        battle = self._ensure_battle()
+        if battle.finished:
+            raise RuntimeError("Cannot call step() on a finished battle. Call reset().")
+        self._steps_this_battle += 1
+        if self._steps_this_battle > self.config.max_steps_per_battle:
+            return self._terminal_from_truncation(battle)
+        valid_actions = enumerate_actions(battle)
+        if isinstance(action_json, dict):
+            raw = json.dumps(action_json)
+        else:
+            raw = action_json
+        used_fallback = False
+        try:
+            parsed = parse_llm_action(raw, valid_actions)
+            order = self._to_battle_order(parsed, valid_actions, battle)
+        except ValueError:
+            extracted = extract_action_json_from_text(raw)
+            if extracted is not None:
+                try:
+                    parsed = parse_llm_action(extracted, valid_actions)
+                    order = self._to_battle_order(parsed, valid_actions, battle)
+                except ValueError:
+                    used_fallback = True
+            else:
+                used_fallback = True
+        if used_fallback:
+            opt = valid_actions[0]
+            from poke_env.player import Player as PlayerCls
+            if opt.action_type == "move" and opt.move is not None:
+                order = PlayerCls.create_order(opt.move)
+            else:
+                order = PlayerCls.create_order(opt.pokemon)
+        previous_turn = battle.turn
+        self._client.send_action(order)
+        new_battle = self._client.wait_for_battle_update(previous_turn) or battle
+        # Increment the passive-hit counter by scanning only the turn that just
+        # resolved — O(k) where k = events on that single turn, not O(total turns).
+        self._cumulative_passive_hits += count_new_passive_hits_for_turn(
+            new_battle, previous_turn
+        )
+        prev_state = self._prev_state or summarize_battle_state(battle, self._cumulative_passive_hits)
+        curr_state = summarize_battle_state(new_battle, self._cumulative_passive_hits)
+        active = new_battle.active_pokemon
+        opponent_active = new_battle.opponent_active_pokemon
+        if used_fallback:
+            reward = ILLEGAL_ACTION_PENALTY
+        else:
+            reward = calculate_reward(
+                prev_state=prev_state,
+                curr_state=curr_state,
+                action=ActionJSON(action=parsed.action, choice=parsed.choice),
+                trackers=self._reward_trackers,
+                active=active,
+                opponent_active=opponent_active,
+            )
+            # Small time cost per turn to discourage excessively long battles.
+            reward += self.config.step_living_penalty
+        self._prev_state = curr_state
+        if new_battle.turn == previous_turn and not new_battle.finished:
+            self._no_progress_steps += 1
+        else:
+            self._no_progress_steps = 0
+        done_reason: Optional[str] = None
+        done = False
+        if new_battle.finished:
+            done = True
+            done_reason = "battle_finished"
+        elif self._steps_this_battle >= self.config.max_steps_per_battle:
+            done = True
+            done_reason = "max_steps"
+            reward += self.config.max_steps_termination_penalty
+        elif (self._battle_reward_total + reward) <= self.config.min_battle_reward:
+            done = True
+            done_reason = "min_battle_reward"
+        elif self._no_progress_steps >= self.config.max_no_progress_steps:
+            done = True
+            done_reason = "no_progress_timeout"
+            reward += self.config.no_progress_termination_penalty
+        self._battle_reward_total += reward
+        # If we terminate early (not a natural finished battle), forfeit cleanly
+        # so the next reset starts from a free player/session state.
+        if done and not new_battle.finished and done_reason in {
+            "max_steps",
+            "min_battle_reward",
+            "no_progress_timeout",
+        }:
+            try:
+                self._client.forfeit_current_battle()
+            except Exception:
+                pass
+        obs = format_battle_state(new_battle, self._opponent_history)
+        info: Dict[str, Any] = {
+            "turn": new_battle.turn,
+            "valid_actions": [
+                {"action": a.action_type, "choice": a.choice} for a in valid_actions
+            ],
+            "instructions": build_action_instructions(valid_actions),
+            "battle_finished": new_battle.finished,
+            "reason": done_reason,
+            "action_illegal": used_fallback,
+            "battle_reward_total": self._battle_reward_total,
+            "no_progress_steps": self._no_progress_steps,
+        }
+        if self.config.verbose_logging:
+            should_log_step = (
+                used_fallback
+                or done
+                or self._steps_this_battle == 1
+                or self._steps_this_battle % max(1, self.config.log_every_n_steps) == 0
+            )
+            if should_log_step:
+                self._log(
+                    f"battle={self._battle_index} step={self._steps_this_battle} "
+                    f"turn={new_battle.turn} reward={reward:.3f} "
+                    f"running_reward={self._battle_reward_total:.3f} "
+                    f"illegal_action={used_fallback} done={done}"
+                )
+        return obs, reward, done, info
+    # ------------------------------------------------------------------ helpers
+    def _wait_for_battle_or_raise(self) -> Battle:
+        battle = self._client.battle
+        if battle is None:
+            battle = self._client.wait_for_battle_update(previous_turn=0)
+        if battle is None:
+            raise RuntimeError("Failed to obtain initial battle from poke-env.")
+        return battle
+    def _ensure_battle(self) -> Battle:
+        battle = self._client.battle
+        if battle is None:
+            raise RuntimeError("No active battle. Call reset() first.")
+        return battle
+    def _terminal_from_truncation(self, battle: Battle) -> Tuple[str, float, bool, Dict[str, Any]]:
+        obs = format_battle_state(battle, self._opponent_history)
+        info: Dict[str, Any] = {
+            "turn": battle.turn,
+            "battle_finished": battle.finished,
+            "reason": "max_steps",
+        }
+        return obs, self.config.max_steps_termination_penalty, True, info
+    @staticmethod
+    def _to_battle_order(
+        parsed: ActionJSON,
+        valid_actions: list[ActionOption],
+        battle: Battle,
+    ) -> "Player.create_order.__annotations__['return']":
+        from poke_env.player import Player as PlayerCls
+        for opt in valid_actions:
+            if opt.action_type == parsed.action and opt.choice == parsed.choice:
+                if opt.action_type == "move" and opt.move is not None:
+                    return PlayerCls.create_order(opt.move)
+                if opt.action_type == "switch" and opt.pokemon is not None:
+                    return PlayerCls.create_order(opt.pokemon)
+        raise ValueError(f"Could not map parsed action {parsed} to a BattleOrder")

src/smogon_rl/pokeenv_client.py ADDED Viewed

	@@ -0,0 +1,304 @@

+from __future__ import annotations
+import asyncio
+import threading
+import time
+from dataclasses import dataclass, field
+from typing import Optional
+from poke_env.environment.battle import Battle
+from poke_env.player import Player, RandomPlayer
+from poke_env.player.battle_order import BattleOrder
+from poke_env.ps_client.server_configuration import LocalhostServerConfiguration
+from .config import DEFAULT_BATTLE_FORMAT, EnvConfig
+class RLPlayer(Player):
+    """Player controlled externally via an asyncio queue of BattleOrders."""
+    def __init__(self, action_queue: "asyncio.Queue[BattleOrder]", **kwargs) -> None:
+        super().__init__(**kwargs)
+        self._action_queue: "asyncio.Queue[BattleOrder]" = action_queue
+    async def choose_move(self, battle: Battle) -> BattleOrder:
+        return await self._action_queue.get()
+@dataclass
+class PokeEnvClient:
+    """Asynchronous client that manages poke-env battles in a background loop.
+    Players are created ONCE when the loop starts and reused across battles to
+    avoid Showdown nametaken errors from zombie connections.
+    """
+    config: EnvConfig
+    def __post_init__(self) -> None:
+        self._loop: Optional[asyncio.AbstractEventLoop] = None
+        self._thread: Optional[threading.Thread] = None
+        self._action_queue: Optional["asyncio.Queue[BattleOrder]"] = None
+        self._rl_player: Optional[RLPlayer] = None
+        self._opponent: Optional[RandomPlayer] = None
+        self._battle_task: Optional[asyncio.Future] = None
+        # Snapshot of existing battle tags before we request a new battle.
+        self._known_battle_tags: set[str] = set()
+        self._awaiting_new_battle: bool = False
+        # Stored reference to the battle we are in (set when .battle is read).
+        # Used for forfeit so we always target the right battle.
+        self._current_battle: Optional[Battle] = None
+    def _log(self, message: str) -> None:
+        if self.config.verbose_logging:
+            print(f"[PokeEnvClient] {message}", flush=True)
+    # -------------------------------------------------------------------------
+    # Event loop management
+    # -------------------------------------------------------------------------
+    def start(self) -> None:
+        """Start the background asyncio loop and create players (once)."""
+        if self._loop is not None:
+            return
+        loop = asyncio.new_event_loop()
+        def _run_loop() -> None:
+            asyncio.set_event_loop(loop)
+            loop.run_forever()
+        thread = threading.Thread(target=_run_loop, daemon=True)
+        thread.start()
+        self._loop = loop
+        self._thread = thread
+        self._log("Background event loop started.")
+        # Create players once; they stay connected for the lifetime of this env.
+        self._action_queue = asyncio.Queue()
+        fmt = self.config.battle_format or DEFAULT_BATTLE_FORMAT
+        async def _create_players() -> None:
+            self._rl_player = RLPlayer(
+                action_queue=self._action_queue,
+                battle_format=fmt,
+                server_configuration=LocalhostServerConfiguration,
+            )
+            self._opponent = RandomPlayer(
+                battle_format=fmt,
+                server_configuration=LocalhostServerConfiguration,
+            )
+        future = asyncio.run_coroutine_threadsafe(_create_players(), loop)
+        future.result(timeout=15.0)
+        # Give the server a moment to register both connections.
+        time.sleep(1.0)
+        self._log("Players created and connected.")
+    def stop(self) -> None:
+        """Stop the background loop and clean up."""
+        if self._loop is None:
+            return
+        self._loop.call_soon_threadsafe(self._loop.stop)
+        if self._thread is not None:
+            self._thread.join(timeout=5.0)
+        self._loop = None
+        self._thread = None
+        self._battle_task = None
+        self._rl_player = None
+        self._opponent = None
+        self._action_queue = None
+        self._known_battle_tags = set()
+        self._awaiting_new_battle = False
+        self._current_battle = None
+        self._log("Background event loop stopped.")
+    def restart(self) -> None:
+        """Hard-restart loop + players to recover from stuck/cancelled battles."""
+        self._log("Restarting client event loop and players.")
+        self.stop()
+        self.start()
+    # -------------------------------------------------------------------------
+    # Battle lifecycle
+    # -------------------------------------------------------------------------
+    def forfeit_current_battle(self) -> None:
+        """Forfeit the current Showdown battle if it is still in progress.
+        Must be called before start_new_battle() when the env ends a battle early
+        (e.g. due to min_battle_reward) so the player is freed for the next battle.
+        """
+        if self._loop is None or self._rl_player is None:
+            return
+        # Use stored battle so we forfeit the one we were in, not whatever .battle returns now.
+        battle = self._current_battle if self._current_battle is not None else self.battle
+        if battle is None or battle.finished:
+            return
+        room = battle.battle_tag
+        async def _do_forfeit() -> None:
+            try:
+                await self._rl_player.send_message("/forfeit", room)
+            except Exception:
+                pass
+        try:
+            fut = asyncio.run_coroutine_threadsafe(_do_forfeit(), self._loop)
+            fut.result(timeout=5.0)
+        except Exception:
+            pass
+        # Give the server time to end the battle and free both players.
+        time.sleep(1.5)
+        self._current_battle = None
+        self._log("Forfeited current battle.")
+    def start_new_battle(self) -> None:
+        """Launch a new battle using the already-connected players."""
+        if self._loop is None:
+            self.start()
+        assert self._loop is not None
+        assert self._rl_player is not None
+        assert self._opponent is not None
+        # Forfeit any ongoing Showdown battle before starting a new one so the
+        # player is not stuck mid-battle when battle_against is called again.
+        self.forfeit_current_battle()
+        # Let the previous battle task finish cleanly (server will end battle
+        # after forfeit). If it does not settle, hard-restart the client.
+        restart_required = False
+        if self._battle_task is not None and not self._battle_task.done():
+            try:
+                self._battle_task.result(timeout=25.0)
+            except Exception:
+                self._battle_task.cancel()
+                self._log("Previous battle task timed out or failed; requesting client restart.")
+                restart_required = True
+            else:
+                self._log("Previous battle task finished.")
+        if restart_required:
+            # Hard recovery path: refresh websocket connections and players.
+            self.restart()
+            assert self._loop is not None
+            assert self._rl_player is not None
+            assert self._opponent is not None
+        self._current_battle = None  # Will be set when the new battle appears.
+        # Let the server fully free both players before we start the next battle.
+        time.sleep(2.0)
+        # Fresh action queue for this battle.
+        self._action_queue = asyncio.Queue()
+        self._rl_player._action_queue = self._action_queue
+        # Record current battle tags so .battle can wait for a genuinely new one.
+        self._known_battle_tags = set(self._rl_player.battles.keys())
+        self._awaiting_new_battle = True
+        async def _run_battle() -> None:
+            await self._rl_player.battle_against(self._opponent, n_battles=1)
+        self._battle_task = asyncio.run_coroutine_threadsafe(
+            _run_battle(), self._loop
+        )
+        self._log(
+            f"Launching new battle in format "
+            f"{self.config.battle_format or DEFAULT_BATTLE_FORMAT}."
+        )
+        time.sleep(self.config.poll_interval_seconds)
+    @property
+    def battle(self) -> Optional[Battle]:
+        """Return the current Battle for this run, or None if not started yet."""
+        if self._rl_player is None or not self._rl_player.battles:
+            return None
+        # During reset(), wait for a battle tag that did not exist before
+        # start_new_battle() was called.
+        if self._awaiting_new_battle:
+            unseen = [
+                b
+                for tag, b in self._rl_player.battles.items()
+                if tag not in self._known_battle_tags
+            ]
+            if not unseen:
+                return None
+            active_unseen = [b for b in unseen if not b.finished]
+            b = active_unseen[-1] if active_unseen else unseen[-1]
+            self._awaiting_new_battle = False
+            self._current_battle = b
+            return b
+        battles = list(self._rl_player.battles.values())
+        active = [b for b in battles if not b.finished]
+        if active:
+            b = active[-1]
+            self._current_battle = b
+            return b
+        # All finished — return the latest one (covers the case where the battle
+        # ended before we got a chance to poll it).
+        b = battles[-1]
+        self._current_battle = b
+        return b
+    def send_action(self, order: BattleOrder) -> None:
+        """Submit an action for the RL player to execute."""
+        if self._loop is None or self._action_queue is None:
+            raise RuntimeError("PokeEnvClient has not been started.")
+        async def _enqueue() -> None:
+            assert self._action_queue is not None
+            await self._action_queue.put(order)
+        asyncio.run_coroutine_threadsafe(_enqueue(), self._loop)
+        self._log("Submitted action to RLPlayer queue.")
+    def wait_for_battle_update(self, previous_turn: int) -> Optional[Battle]:
+        """Block until the battle advances to a new turn or ends."""
+        start_time = time.time()
+        heartbeat_every = max(self.config.poll_heartbeat_seconds, self.config.poll_interval_seconds)
+        next_heartbeat_at = start_time + heartbeat_every
+        while True:
+            battle = self.battle
+            if battle is None:
+                now = time.time()
+                if now > next_heartbeat_at:
+                    elapsed = now - start_time
+                    self._log(
+                        f"Still waiting for battle object "
+                        f"({elapsed:.1f}s elapsed, previous_turn={previous_turn})."
+                    )
+                    next_heartbeat_at = now + heartbeat_every
+                if now - start_time > self.config.open_timeout:
+                    self._log("Timed out waiting for initial battle object.")
+                    return None
+                time.sleep(self.config.poll_interval_seconds)
+                continue
+            if battle.finished or battle.turn > previous_turn:
+                self._log(
+                    f"Battle update received: turn={battle.turn}, finished={battle.finished}."
+                )
+                return battle
+            now = time.time()
+            if now > next_heartbeat_at:
+                elapsed = now - start_time
+                self._log(
+                    f"Waiting for turn advance: current_turn={battle.turn}, "
+                    f"previous_turn={previous_turn}, elapsed={elapsed:.1f}s."
+                )
+                next_heartbeat_at = now + heartbeat_every
+            if now - start_time > self.config.open_timeout:
+                self._log(
+                    f"Turn-advance wait timed out at turn={battle.turn}; returning last state."
+                )
+                return battle
+            time.sleep(self.config.poll_interval_seconds)

src/smogon_rl/reward.py ADDED Viewed

	@@ -0,0 +1,320 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+from poke_env.environment.battle import Battle
+from poke_env.environment.pokemon import Pokemon
+from .action_space import ActionJSON
+from .state_formatter import hp_fraction_to_percent
+# Hefty penalty when model outputs illegal action (e.g. hallucinated Pokemon).
+# Used during rollout collection; recorded as collected_reward so GRPO learns to avoid illegal outputs.
+ILLEGAL_ACTION_PENALTY = -10.0
+@dataclass
+class BattleStateSummary:
+    self_team_hp_percent: float
+    opp_team_hp_percent: float
+    self_fainted: int
+    opp_fainted: int
+    self_statuses: Dict[str, Optional[str]]
+    opp_statuses: Dict[str, Optional[str]]
+    self_stat_stages: Dict[str, Dict[str, int]]
+    opp_stat_stages: Dict[str, Dict[str, int]]
+    opponent_passive_hits: int
+@dataclass
+class RewardTrackingState:
+    healing_reward_used: float = 0.0
+    per_pokemon_setup_reward_used: Dict[str, float] = field(default_factory=dict)
+    passive_hits_total: int = 0
+def _team_hp_and_faints(team: Dict[str, Pokemon]) -> tuple[float, int]:
+    total_hp = 0.0
+    total_max_hp = 0.0
+    fainted = 0
+    for mon in team.values():
+        if mon.max_hp is None or mon.max_hp <= 0:
+            continue
+        total_hp += max(0, mon.current_hp)
+        total_max_hp += mon.max_hp
+        if mon.fainted:
+            fainted += 1
+    if total_max_hp <= 0:
+        return 0.0, fainted
+    return (total_hp / total_max_hp) * 100.0, fainted
+def _collect_statuses(team: Dict[str, Pokemon]) -> Dict[str, Optional[str]]:
+    return {
+        mon.species or key: (str(mon.status) if mon.status is not None else None)
+        for key, mon in team.items()
+    }
+def _collect_stat_stages(team: Dict[str, Pokemon]) -> Dict[str, Dict[str, int]]:
+    return {mon.species or key: dict(mon.boosts) for key, mon in team.items()}
+def _passive_events_in_turn(events: list, opponent_role: str) -> int:
+    """Count passive-damage hits for the opponent in one turn's raw event list."""
+    count = 0
+    for event in events:
+        if not event or event[0] != "-damage":
+            continue
+        if len(event) < 2:
+            continue
+        if not event[1].startswith(opponent_role):
+            continue
+        # "[from]" in any trailing field marks an external/passive damage source:
+        # e.g. "[from] brn", "[from] Stealth Rock", "[from] Leech Seed", etc.
+        if any("[from]" in part for part in event[2:]):
+            count += 1
+    return count
+def count_new_passive_hits_for_turn(battle: Battle, turn_number: int) -> int:
+    """Count passive damage hits the opponent took on a single, specific turn.
+    Designed for O(k) per step use: only the events from `turn_number` are
+    scanned. The caller accumulates the running total across turns.
+    Parameters
+    ----------
+    battle:
+        The current poke-env Battle object.
+    turn_number:
+        The turn whose Observation.events should be inspected (usually the
+        turn that just resolved, i.e., the value of `battle.turn` before
+        the action was submitted).
+    """
+    obs = battle.observations.get(turn_number)
+    if obs is None:
+        return 0
+    opponent_role = "p2" if battle.player_role == "p1" else "p1"
+    return _passive_events_in_turn(obs.events, opponent_role)
+def _count_passive_hits_on_opponent(battle: Battle) -> int:
+    """Full-scan fallback: count cumulative passive hits across all observed turns.
+    This is O(total events) and should only be called once on reset() to
+    establish a baseline. Per-step increments should use
+    `count_new_passive_hits_for_turn` instead.
+    """
+    opponent_role = "p2" if battle.player_role == "p1" else "p1"
+    count = 0
+    for obs in battle.observations.values():
+        count += _passive_events_in_turn(obs.events, opponent_role)
+    return count
+def summarize_battle_state(battle: Battle, cumulative_passive_hits: int = 0) -> BattleStateSummary:
+    """Snapshot the current battle state into a plain dataclass.
+    Parameters
+    ----------
+    battle:
+        The live poke-env Battle object.
+    cumulative_passive_hits:
+        Running total of passive damage hits the opponent has taken this
+        battle, maintained by the caller (e.g. PokemonShowdownEnv) using
+        `count_new_passive_hits_for_turn` to keep each step O(k).
+        Defaults to 0 for the initial state on reset().
+    """
+    self_hp, self_fainted = _team_hp_and_faints(battle.team)
+    opp_hp, opp_fainted = _team_hp_and_faints(battle.opponent_team)
+    self_statuses = _collect_statuses(battle.team)
+    opp_statuses = _collect_statuses(battle.opponent_team)
+    self_stats = _collect_stat_stages(battle.team)
+    opp_stats = _collect_stat_stages(battle.opponent_team)
+    return BattleStateSummary(
+        self_team_hp_percent=self_hp,
+        opp_team_hp_percent=opp_hp,
+        self_fainted=self_fainted,
+        opp_fainted=opp_fainted,
+        self_statuses=self_statuses,
+        opp_statuses=opp_statuses,
+        self_stat_stages=self_stats,
+        opp_stat_stages=opp_stats,
+        opponent_passive_hits=cumulative_passive_hits,
+    )
+def _status_penalty(prev_statuses: Dict[str, Optional[str]], curr_statuses: Dict[str, Optional[str]]) -> float:
+    penalty = 0.0
+    for key, curr in curr_statuses.items():
+        prev = prev_statuses.get(key)
+        if prev == curr:
+            continue
+        if curr is None:
+            # Could be a status cure handled elsewhere.
+            continue
+        code = curr.lower()
+        if code in {"brn", "psn", "tox"}:
+            penalty -= 0.5
+        elif code in {"par", "frz", "slp", "conf"}:
+            penalty -= 1.0
+    return penalty
+def _healing_reward(prev_hp: float, curr_hp: float, trackers: RewardTrackingState) -> float:
+    if curr_hp <= prev_hp:
+        return 0.0
+    healed = curr_hp - prev_hp
+    raw = (healed / 10.0)  # +1.0 per 10% healed
+    remaining_cap = max(0.0, 3.0 - trackers.healing_reward_used)
+    reward = min(raw, remaining_cap)
+    trackers.healing_reward_used += reward
+    return reward
+def _setup_reward(
+    prev_stats: Dict[str, Dict[str, int]],
+    curr_stats: Dict[str, Dict[str, int]],
+    active: Pokemon,
+    trackers: RewardTrackingState,
+) -> float:
+    active_key = active.species or "active"
+    prev = prev_stats.get(active_key, {})
+    curr = curr_stats.get(active_key, {})
+    delta_stages = 0
+    for stat, curr_stage in curr.items():
+        prev_stage = prev.get(stat, 0)
+        if curr_stage > prev_stage:
+            delta_stages += curr_stage - prev_stage
+    if delta_stages <= 0:
+        return 0.0
+    if hp_fraction_to_percent(active.current_hp_fraction) <= 50.0:
+        return 0.0
+    raw = 0.5 * delta_stages
+    used = trackers.per_pokemon_setup_reward_used.get(active_key, 0.0)
+    remaining_cap = max(0.0, 2.0 - used)
+    reward = min(raw, remaining_cap)
+    trackers.per_pokemon_setup_reward_used[active_key] = used + reward
+    return reward
+def _opponent_setup_penalty(
+    prev_stats: Dict[str, Dict[str, int]],
+    curr_stats: Dict[str, Dict[str, int]],
+) -> float:
+    penalty = 0.0
+    for key, curr in curr_stats.items():
+        prev = prev_stats.get(key, {})
+        for stat, curr_stage in curr.items():
+            prev_stage = prev.get(stat, 0)
+            if curr_stage > prev_stage:
+                penalty -= 0.5 * (curr_stage - prev_stage)
+    return penalty
+def _passive_damage_reward(
+    prev_hits: int,
+    curr_hits: int,
+    trackers: RewardTrackingState,
+) -> float:
+    if curr_hits <= prev_hits:
+        return 0.0
+    delta = curr_hits - prev_hits
+    trackers.passive_hits_total += delta
+    return 0.01 * trackers.passive_hits_total
+def _damage_rewards(prev: BattleStateSummary, curr: BattleStateSummary) -> float:
+    reward = 0.0
+    # Damage dealt: +1.0 per 10% opponent HP reduced
+    if curr.opp_team_hp_percent < prev.opp_team_hp_percent:
+        delta = prev.opp_team_hp_percent - curr.opp_team_hp_percent
+        reward += delta / 10.0
+    # Damage taken: -1.0 per 10% self HP lost
+    if curr.self_team_hp_percent < prev.self_team_hp_percent:
+        delta = prev.self_team_hp_percent - curr.self_team_hp_percent
+        reward -= delta / 10.0
+    return reward
+def _knockout_rewards(prev: BattleStateSummary, curr: BattleStateSummary) -> float:
+    reward = 0.0
+    if curr.opp_fainted > prev.opp_fainted:
+        reward += 3.0 * (curr.opp_fainted - prev.opp_fainted)
+    if curr.self_fainted > prev.self_fainted:
+        reward -= 3.0 * (curr.self_fainted - prev.self_fainted)
+    return reward
+def calculate_reward(
+    prev_state: BattleStateSummary,
+    curr_state: BattleStateSummary,
+    action: ActionJSON,
+    trackers: RewardTrackingState,
+    active: Optional[Pokemon] = None,
+    opponent_active: Optional[Pokemon] = None,
+    move_was_super_effective: bool = False,
+    move_hit: bool = True,
+    move_was_immune: bool = False,
+    team_status_cured: bool = False,
+) -> float:
+    """Compute shaped reward between two consecutive battle summaries.
+    The additional keyword arguments allow the caller to provide extra context from
+    the last action (type effectiveness, accuracy result, status cures) that are
+    not fully recoverable from the static battle snapshots alone.
+    """
+    reward = 0.0
+    # Core mechanics
+    reward += _damage_rewards(prev_state, curr_state)
+    reward += _knockout_rewards(prev_state, curr_state)
+    # Strategic nudges: type effectiveness and accuracy
+    if action.action == "move":
+        if move_was_super_effective:
+            reward += 0.5
+        if move_was_immune:
+            reward -= 1.0
+        if not move_hit:
+            reward -= 0.25
+    # Healing
+    reward += _healing_reward(
+        prev_state.self_team_hp_percent,
+        curr_state.self_team_hp_percent,
+        trackers,
+    )
+    # Status cures (e.g., Aromatherapy)
+    if team_status_cured:
+        reward += 1.0
+    # Setup sweeping (self) and opponent setup
+    if active is not None:
+        reward += _setup_reward(
+            prev_state.self_stat_stages,
+            curr_state.self_stat_stages,
+            active,
+            trackers,
+        )
+    reward += _opponent_setup_penalty(
+        prev_state.opp_stat_stages,
+        curr_state.opp_stat_stages,
+    )
+    # Passive damage / hazards
+    reward += _passive_damage_reward(
+        prev_state.opponent_passive_hits,
+        curr_state.opponent_passive_hits,
+        trackers,
+    )
+    # Status afflictions
+    reward += _status_penalty(prev_state.self_statuses, curr_state.self_statuses)
+    return reward

src/smogon_rl/state_formatter.py ADDED Viewed

	@@ -0,0 +1,181 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+from poke_env.environment.battle import Battle
+from poke_env.environment.pokemon import Pokemon
+@dataclass
+class OpponentMonHistory:
+    name: str
+    last_known_hp_percent: float
+    status: Optional[str]
+    revealed_moves: List[str] = field(default_factory=list)
+    revealed_item: Optional[str] = None
+    revealed_ability: Optional[str] = None
+@dataclass
+class OpponentHistoryTracker:
+    revealed: Dict[str, OpponentMonHistory] = field(default_factory=dict)
+    def update_from_battle(self, battle: Battle) -> None:
+        for mon in battle.opponent_team.values():
+            if not mon.species:
+                continue
+            key = mon.species
+            entry = self.revealed.get(
+                key,
+                OpponentMonHistory(
+                    name=mon.species,
+                    last_known_hp_percent=hp_fraction_to_percent(mon.current_hp_fraction),
+                    status=str(mon.status) if mon.status is not None else None,
+                ),
+            )
+            entry.last_known_hp_percent = hp_fraction_to_percent(mon.current_hp_fraction)
+            entry.status = str(mon.status) if mon.status is not None else None
+            for move in mon.moves.values():
+                move_name = move.id
+                if move_name not in entry.revealed_moves:
+                    entry.revealed_moves.append(move_name)
+            if mon.item is not None:
+                entry.revealed_item = mon.item
+            if mon.ability is not None:
+                entry.revealed_ability = mon.ability
+            self.revealed[key] = entry
+def hp_fraction_to_percent(fraction: float | None) -> float:
+    if fraction is None:
+        return 0.0
+    return max(0.0, min(1.0, float(fraction))) * 100.0
+def _format_stat_modifiers(pokemon: Pokemon) -> str:
+    parts: List[str] = []
+    for stat, stage in pokemon.boosts.items():
+        if stage == 0:
+            continue
+        sign = "+" if stage > 0 else ""
+        parts.append(f"{stat.capitalize()} {sign}{stage}")
+    return ", ".join(parts) if parts else "None"
+def _estimate_speed_range(pokemon: Pokemon) -> str:
+    base_speed = pokemon.base_stats.get("spe", 0)
+    if base_speed <= 0:
+        return "Unknown"
+    level = 100
+    min_speed = int((((2 * base_speed) * level) / 100 + 5) * 0.9)
+    max_speed = int((((2 * base_speed + 31 + (252 // 4)) * level) / 100 + 5) * 1.1)
+    return f"{min_speed}-{max_speed}"
+def _format_pokemon_line(pokemon: Pokemon) -> str:
+    hp = hp_fraction_to_percent(pokemon.current_hp_fraction)
+    status = str(pokemon.status) if pokemon.status is not None else "OK"
+    item = pokemon.item or "?"
+    return f"- {pokemon.species or '?'} HP:{hp:.0f}% {status} Item:{item}"
+def _format_moveset_section(pokemon: Pokemon) -> str:
+    if not pokemon.moves:
+        return "  Moves: [unknown]"
+    parts = []
+    for move in pokemon.moves.values():
+        bp = move.base_power or 0
+        t = move.type.name[0] if move.type is not None else "?"
+        parts.append(f"{move.id}({t}{bp})")
+    return "  Moves: " + " | ".join(parts)
+def format_battle_state(battle: Battle, opponent_history: OpponentHistoryTracker) -> str:
+    """Format the full battle state into a markdown string for the LLM.
+    Structure:
+    - Part A: Active field (self and opponent).
+    - Part B: Full self roster and movesets.
+    - Part C: Opponent history (revealed bench, revealed info).
+    """
+    opponent_history.update_from_battle(battle)
+    lines: List[str] = []
+    # ------------------------------------------------------------------ Part A
+    lines.append("## Part A: Active Field")
+    # Self active
+    self_active = battle.active_pokemon
+    if self_active is not None:
+        self_hp = hp_fraction_to_percent(self_active.current_hp_fraction)
+        self_status = (
+            str(self_active.status) if self_active.status is not None else "Healthy"
+        )
+        self_ability = self_active.ability or "Unknown"
+        self_item = self_active.item or "None"
+        self_mods = _format_stat_modifiers(self_active)
+        lines.append("### Active Self")
+        lines.append(
+            f"- Name: {self_active.species or 'Unknown'}\n"
+            f"- HP: {self_hp:.1f}%\n"
+            f"- Status: {self_status}\n"
+            f"- Ability: {self_ability}\n"
+            f"- Item: {self_item}\n"
+            f"- Stat Modifiers: {self_mods}"
+        )
+    else:
+        lines.append("### Active Self\n- None")
+    # Opponent active
+    opp_active = battle.opponent_active_pokemon
+    if opp_active is not None:
+        opp_hp = hp_fraction_to_percent(opp_active.current_hp_fraction)
+        opp_status = (
+            str(opp_active.status) if opp_active.status is not None else "Healthy"
+        )
+        opp_speed_range = _estimate_speed_range(opp_active)
+        lines.append("### Active Opponent")
+        lines.append(
+            f"- Name: {opp_active.species or 'Unknown'}\n"
+            f"- HP: {opp_hp:.1f}%\n"
+            f"- Status: {opp_status}\n"
+            f"- Speed Range: {opp_speed_range}"
+        )
+    else:
+        lines.append("### Active Opponent\n- None")
+    # ------------------------------------------------------------------ Part B
+    lines.append("\n## Part B: Full Self Roster")
+    if not battle.team:
+        lines.append("- [Unknown team]")
+    else:
+        for mon in battle.team.values():
+            lines.append(_format_pokemon_line(mon))
+            lines.append(_format_moveset_section(mon))
+    # ------------------------------------------------------------------ Part C
+    lines.append("\n## Part C: Opponent History")
+    if not opponent_history.revealed:
+        lines.append("- No opponent Pokémon revealed yet.")
+    else:
+        for entry in opponent_history.revealed.values():
+            lines.append(
+                f"- {entry.name} | Last HP: {entry.last_known_hp_percent:.1f}% | "
+                f"Status: {entry.status or 'Healthy'}"
+            )
+            if entry.revealed_moves:
+                moves = ", ".join(entry.revealed_moves)
+                lines.append(f"  - Revealed moves: {moves}")
+            if entry.revealed_item:
+                lines.append(f"  - Revealed item: {entry.revealed_item}")
+            if entry.revealed_ability:
+                lines.append(f"  - Revealed ability: {entry.revealed_ability}")
+    return "\n".join(lines)

trainer.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff