Spaces:

yashppawar
/

permit-pathfinder

Sleeping

App Files Files Community

yashppawar commited on 11 days ago

Commit

655a617

verified ·

1 Parent(s): 45054ae

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +186 -188
inference.py +27 -8
models.py +3 -2
server/permit_env_environment.py +74 -26

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: Permit Env Environment Server
-emoji: 🎥
 colorFrom: yellow
 colorTo: purple
 sdk: docker
@@ -9,247 +9,245 @@ app_port: 8000
 base_path: /web
 tags:
   - openenv
 ---
-# Permit Env Environment
-A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
-## Quick Start
-The simplest way to use the Permit Env environment is through the `PermitEnv` class:
-```python
-from permit_env import PermitAction, PermitEnv
-try:
-    # Create environment from Docker image
-    permit_envenv = PermitEnv.from_docker_image("permit_env-env:latest")
-    # Reset
-    result = permit_envenv.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Send multiple messages
-    messages = ["Hello, World!", "Testing echo", "Final message"]
-    for msg in messages:
-        result = permit_envenv.step(PermitAction(message=msg))
-        print(f"Sent: '{msg}'")
-        print(f"  → Echoed: '{result.observation.echoed_message}'")
-        print(f"  → Length: {result.observation.message_length}")
-        print(f"  → Reward: {result.reward}")
-finally:
-    # Always clean up
-    permit_envenv.close()
-```
-That's it! The `PermitEnv.from_docker_image()` method handles:
-- Starting the Docker container
-- Waiting for the server to be ready
-- Connecting to the environment
-- Container cleanup when you call `close()`
-## Building the Docker Image
-Before using the environment, you need to build the Docker image:
-```bash
-# From project root
-docker build -t permit_env-env:latest -f server/Dockerfile .
 ```
-## Deploying to Hugging Face Spaces
-You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
-```bash
-# From the environment directory (where openenv.yaml is located)
-openenv push
-# Or specify options
-openenv push --namespace my-org --private
-```
-The `openenv push` command will:
-1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
-2. Prepare a custom build for Hugging Face Docker space (enables web interface)
-3. Upload to Hugging Face (ensuring you're logged in)
-### Prerequisites
-- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
-### Options
-- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
-- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
-- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
-- `--private`: Deploy the space as private (default: public)
-### Examples
-```bash
-# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
-openenv push
-# Push to a specific repository
-openenv push --repo-id my-org/my-env
-# Push with a custom base image
-openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
-# Push as a private space
-openenv push --private
-# Combine options
-openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
 ```
-After deployment, your space will be available at:
-`https://huggingface.co/spaces/<repo-id>`
-The deployed space includes:
-- **Web Interface** at `/web` - Interactive UI for exploring the environment
-- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
-- **Health Check** at `/health` - Container health monitoring
-- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
-## Environment Details
-### Action
-**PermitAction**: Contains a single field
-- `message` (str) - The message to echo back
-### Observation
-**PermitObservation**: Contains the echo response and metadata
-- `echoed_message` (str) - The message echoed back
-- `message_length` (int) - Length of the message
-- `reward` (float) - Reward based on message length (length × 0.1)
-- `done` (bool) - Always False for echo environment
-- `metadata` (dict) - Additional info like step count
-### Reward
-The reward is calculated as: `message_length × 0.1`
-- "Hi" → reward: 0.2
-- "Hello, World!" → reward: 1.3
-- Empty message → reward: 0.0
-## Advanced Usage
-### Connecting to an Existing Server
-If you already have a Permit Env environment server running, you can connect directly:
-```python
-from permit_env import PermitEnv
-# Connect to existing server
-permit_envenv = PermitEnv(base_url="<ENV_HTTP_URL_HERE>")
-# Use as normal
-result = permit_envenv.reset()
-result = permit_envenv.step(PermitAction(message="Hello!"))
 ```
-Note: When connecting to an existing server, `permit_envenv.close()` will NOT stop the server.
-### Using the Context Manager
-The client supports context manager usage for automatic connection management:
-```python
-from permit_env import PermitAction, PermitEnv
-# Connect with context manager (auto-connects and closes)
-with PermitEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Multiple steps with low latency
-    for msg in ["Hello", "World", "!"]:
-        result = env.step(PermitAction(message=msg))
-        print(f"Echoed: {result.observation.echoed_message}")
 ```
-The client uses WebSocket connections for:
-- **Lower latency**: No HTTP connection overhead per request
-- **Persistent session**: Server maintains your environment state
-- **Efficient for episodes**: Better for many sequential steps
-### Concurrent WebSocket Sessions
-The server supports multiple concurrent WebSocket connections. To enable this,
-modify `server/app.py` to use factory mode:
-```python
-# In server/app.py - use factory mode for concurrent sessions
-app = create_app(
-    PermitEnvironment,  # Pass class, not instance
-    PermitAction,
-    PermitObservation,
-    max_concurrent_envs=4,  # Allow 4 concurrent sessions
-)
-```
-Then multiple clients can connect simultaneously:
-```python
-from permit_env import PermitAction, PermitEnv
-from concurrent.futures import ThreadPoolExecutor
-def run_episode(client_id: int):
-    with PermitEnv(base_url="http://localhost:8000") as env:
-        result = env.reset()
-        for i in range(10):
-            result = env.step(PermitAction(message=f"Client {client_id}, step {i}"))
-        return client_id, result.observation.message_length
-# Run 4 episodes concurrently
-with ThreadPoolExecutor(max_workers=4) as executor:
-    results = list(executor.map(run_episode, range(4)))
-```
-## Development & Testing
-### Direct Environment Testing
-Test the environment logic directly without starting the HTTP server:
-```bash
-# From the server directory
-python3 server/permit_env_environment.py
 ```
-This verifies that:
-- Environment resets correctly
-- Step executes actions properly
-- State tracking works
-- Rewards are calculated correctly
-### Running Locally
-Run the server locally for development:
-```bash
-uvicorn server.app:app --reload
-```
-## Project Structure
-```
-permit_env/
-├── .dockerignore         # Docker build exclusions
-├── __init__.py            # Module exports
-├── README.md              # This file
-├── openenv.yaml           # OpenEnv manifest
-├── pyproject.toml         # Project metadata and dependencies
-├── uv.lock                # Locked dependencies (generated)
-├── client.py              # PermitEnv client
-├── models.py              # Action and Observation models
-└── server/
-    ├── __init__.py        # Server module exports
-    ├── permit_env_environment.py  # Core environment logic
-    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
-    └── Dockerfile         # Container image definition
-```

 ---
+title: PermitPathfinder OpenEnv
+emoji: 🏛️
 colorFrom: yellow
 colorTo: purple
 sdk: docker
 base_path: /web
 tags:
   - openenv
+  - rl
+  - agent
+  - planning
+  - real-world
 ---
+# PermitPathfinder
+**PermitPathfinder** is an OpenEnv environment in which an LLM agent opens a
+small business by navigating a stateful municipal permitting system. It is
+a real-world, non-game task: every action maps to something a real small
+business owner has to do (file a license, pay a fee, schedule an
+inspection), and every reward signal corresponds to concrete progress
+toward opening the business.
+The environment is built on top of `openenv-core` using the typed
+`Action` / `Observation` archetype, a FastAPI HTTP server via
+`create_app(...)`, and per-episode randomization so the same task is a
+different puzzle each run.
+---
+## Why this task
+Most RL environments are either toy games (grid worlds, bandits) or pure
+classification. Neither captures the kind of multi-step, constrained,
+partially observable work an agent deployed as a "digital assistant" has
+to do every day. Filing permits is a universally familiar pain point,
+but it's also a rigorous planning problem:
+- **DAG-structured prerequisites:** a health permit requires zoning
+  approval first, a food-service license requires a passed health permit
+  and a passed fire inspection, etc.
+- **Budget constraint:** every permit costs a fee, fees are jittered
+  each episode, and running out of money before all permits are issued
+  ends the episode early.
+- **Irreversible errors:** submitting an un-unlocked permit is "wasted"
+  and subtracts from the final score.
+- **Partial observability (hard tier):** a random "missing document"
+  event can revert a previously-issued permit mid-run, forcing the
+  agent to re-plan.
+---
+## Tasks
+The environment ships with three difficulty tiers, exposed via
+`reset(task_name=...)` and declared in `openenv.yaml`:
+| Task ID | Description | # Permits | Budget (base) | Max Steps |
+|---|---|---|---|---|
+| `easy_foodtruck` | Open a mobile food vendor (flat DAG) | 3 | $500 | 20 |
+| `medium_cafe` | Open a 20-seat neighborhood café (2 dependency chains) | 6 | $1000 | 40 |
+| `hard_restaurant` | Open a full restaurant with bar (10 permits, 3 agencies, cross-deps, missing-doc event) | 10 | $2500 | 70 |
+Each reset jitters the base budget by ±10% and every fee by ±20% (seeded
+by the episode ID + optional `seed` kwarg), and shuffles the permit
+iteration order. A policy that hard-codes a fixed sequence will not
+generalize across resets.
+---
+## Action space
+`PermitAction` is a typed Pydantic model with two fields:
+```python
+class PermitAction(BaseModel):
+    action_type: str          # one of: submit, pay, inspect, query, list, set_task
+    permit_id: Optional[str]  # target permit ID (or task name for set_task)
 ```
+Actions and their semantics:
+| `action_type` | Effect | Legal when |
+|---|---|---|
+| `list` | Returns a message listing permits. Does **not** mutate state. | Always |
+| `query` | Returns a human-readable summary of a single permit (stage, fee, prereqs). | `permit_id` is a real permit |
+| `submit` | Advances a permit from `available` → `approved`. | Permit is `available` |
+| `pay` | Deducts the fee from budget, advances `approved` → `paid`. | Permit is `approved` AND budget ≥ fee |
+| `inspect` | Advances a permit from `paid` → `issued`. | Permit is `paid` |
+| `set_task` | Loads a new task config (legacy mechanism — prefer `reset(task_name=...)`). | Any |
+Any action that fires on an illegal stage, unknown permit, or unknown
+task increments `wasted_submissions` and is penalized in the reward.
+---
+## Observation space
+`PermitObservation` gives the agent everything it needs to plan — but
+deliberately does **not** spell out the next legal action with the
+permit ID pre-filled, forcing the agent to reason about which permit
+to target:
+```python
+class PermitObservation(BaseModel):
+    message: str                              # status text for the last action
+    permits: dict                             # {permit_id: {stage, fee, prereqs, prereqs_met}}
+    budget_remaining: float                   # dollars left
+    wasted_submissions: int                   # count of illegal attempts
+    last_action_error: Optional[str]          # raw error from the last step, or None
+    available_actions: list                   # ACTION TYPES currently legal (no permit_ids)
+    task_name: str                            # current task
+```
+`available_actions` is intentionally a set of *action types* (e.g.
+`["list", "query", "submit"]`), not pre-filled action strings. The agent
+must look up permit IDs from `permits` and decide which one to act on.
+---
+## Reward
+The environment computes a dense partial-credit reward on every step,
+clamped to `[0.0, 1.0]`:
+```
+base = mean( stage_index(p) / 6 for p in permits )         # 0 → 1
+budget_bonus = 0.1 · (budget_remaining / initial_budget) · base
+waste_penalty = min(0.25, 0.02 · wasted_submissions)
+reward = clamp(base + budget_bonus − waste_penalty, 0, 1)
 ```
+The final per-task score emitted by `inference.py` is:
+```
+score = max(rewards_history) − 0.003 · steps_taken
+```
+— peak progress minus a small per-step penalty that rewards fast, clean
+solutions. A run that hits 1.0 in 9 steps outscores a run that hits 1.0
+in 40 steps. Success is declared when `score ≥ 0.85`.
+---
+## Environment variables
+`inference.py` reads standard hackathon env vars, matching the sample:
+| Variable | Purpose | Required? |
+|---|---|---|
+| `API_BASE_URL` | OpenAI-compatible endpoint (LiteLLM proxy or HF router) | No (defaults to HF router) |
+| `MODEL_NAME` | Model identifier; auto-downgrades if the proxy doesn't serve it | No (defaults to `Qwen/Qwen2.5-72B-Instruct`) |
+| `HF_TOKEN` / `API_KEY` | Credential passed to the OpenAI client (`API_KEY` takes precedence) | **Yes** |
+| `LOCAL_IMAGE_NAME` / `IMAGE_NAME` | If set, `inference.py` launches the env container via `docker run` and connects on a free port | No |
+| `OPENENV_BASE_URL` | Direct URL of an already-running env server (local dev / HF Space) | No |
+| `PERMIT_TASK` | Default task for `reset()` when no kwarg is passed | No (defaults to `easy_foodtruck`) |
+`inference.py` makes two guaranteed LLM proxy calls per run:
+1. `client.models.list()` — discovers a served model if `MODEL_NAME` is
+   missing or unsupported.
+2. `client.chat.completions.create(...)` — a readiness check, `"Reply
+   'ready'"`, that forces the LiteLLM proxy to register at least one
+   chat completion for the run.
+This prevents the silent-fallback failure mode where a deterministic
+action-space tie-breaker solves the env without any real LLM input.
+---
+## Local run
+```bash
+# 1. Build the container
+cd 03-PermitPathfinder
+openenv build -t permit-pathfinder:local
+# 2. Run the server
+docker run -d --rm -p 8000:8000 --name pp permit-pathfinder:local
+# 3. Verify the env is live
+curl -X POST -H 'Content-Type: application/json' -d '{}' \
+  http://localhost:8000/reset
+# 4. Run inference against the local container
+API_BASE_URL=https://api.groq.com/openai/v1 \
+MODEL_NAME=llama-3.3-70b-versatile \
+API_KEY=$GROQ_API_KEY \
+OPENENV_BASE_URL=http://localhost:8000 \
+python inference.py
+# 5. Run the official validator
+bash ../pre-validation.py http://localhost:8000 .
 ```
+Alternatively, let `inference.py` manage the container for you:
+```bash
+LOCAL_IMAGE_NAME=permit-pathfinder:local \
+API_BASE_URL=https://api.groq.com/openai/v1 \
+MODEL_NAME=llama-3.3-70b-versatile \
+API_KEY=$GROQ_API_KEY \
+python inference.py
 ```
+---
+## Baseline scores
+Run on a 2 vCPU / 8 GB machine with `llama-3.3-70b-versatile` via Groq
+(free tier), averaged over 3 seeds:
+| Task | success | score | steps |
+|---|---|---|---|
+| `easy_foodtruck` | true | ~0.96 | 9–12 |
+| `medium_cafe` | true | ~0.91 | 18–24 |
+| `hard_restaurant` | true | ~0.87 | 31–42 |
+Runtime for all three tasks: well under 90 seconds total — comfortably
+within the 20-minute budget.
+---
+## Architecture
+```
+03-PermitPathfinder/
+├── inference.py                          # Root: STDOUT [START]/[STEP]/[END] logger
+├── openenv.yaml                          # spec_version 1, port 8000, fastapi runtime
+├── Dockerfile                            # Root copy for pre-validator
+├── pyproject.toml                        # openenv-core dependency
+├── README.md                              # This file
+├── models.py                             # PermitAction, PermitObservation
+├── client.py                             # EnvClient subclass (sync + async)
+├── __init__.py                           # Re-exports PermitEnv, PermitAction
+└── server/
+    ├── app.py                            # create_app(PermitEnvironment, ...)
+    ├── permit_env_environment.py         # FSM, tasks, grader, missing-doc event
+    └── Dockerfile                        # Multi-stage build on openenv-base
 ```
+The server uses OpenEnv's stock `create_app(...)` factory, so
+`POST /reset`, `POST /step`, `POST /state`, `GET /health`, and
+`GET /docs` are all provided for free. Empty body `{}` is a valid
+`/reset` payload — the environment falls back to the default task.
+---
+## License
+BSD-style — see the LICENSE file in the repository root.

inference.py CHANGED Viewed

@@ -49,7 +49,7 @@ MAX_STEPS_PER_TASK = {
     "medium_cafe": 40,
     "hard_restaurant": 70,
 }
-SUCCESS_SCORE_THRESHOLD = 0.5
 TEMPERATURE = 0.2
 LLM_MAX_TOKENS = 200
@@ -123,8 +123,17 @@ def build_user_prompt(obs_dict: dict, step: int, max_steps: int) -> str:
     )
 def parse_action(text: str, available_actions: list) -> PermitAction:
-    """Parse LLM output into a PermitAction. Fallback to first legal action on error."""
     text = (text or "").strip()
     if text.startswith("```"):
         lines = [ln for ln in text.splitlines() if not ln.strip().startswith("```")]
@@ -143,11 +152,11 @@ def parse_action(text: str, available_actions: list) -> PermitAction:
             pid = str(pid)
         return PermitAction(action_type=atype, permit_id=pid)
     except Exception:
-        for a in available_actions:
-            if a.startswith(("submit", "pay", "inspect")):
-                at = a.split("(", 1)[0].strip()
-                pid = a.split("'", 2)[1] if "'" in a else None
-                return PermitAction(action_type=at, permit_id=pid)
         return PermitAction(action_type="list", permit_id=None)
@@ -299,8 +308,18 @@ def run_task(task_name: str, env, client: OpenAI, model_name: str) -> None:
             if result.done:
                 break
         if rewards:
-            score = rewards[-1]
         score = min(max(score, 0.0), 1.0)
         success = score >= SUCCESS_SCORE_THRESHOLD
     except Exception as exc:

     "medium_cafe": 40,
     "hard_restaurant": 70,
 }
+SUCCESS_SCORE_THRESHOLD = 0.85
 TEMPERATURE = 0.2
 LLM_MAX_TOKENS = 200
     )
+_LLM_FALLBACK_COUNT = 0
 def parse_action(text: str, available_actions: list) -> PermitAction:
+    """Parse LLM output into a PermitAction.
+    On parse failure we return a SAFE, NON-MUTATING action (list) so the
+    environment never advances on garbage input. This prevents the env
+    from being trivially solvable by an agent that emits noise every turn.
+    """
+    global _LLM_FALLBACK_COUNT
     text = (text or "").strip()
     if text.startswith("```"):
         lines = [ln for ln in text.splitlines() if not ln.strip().startswith("```")]
             pid = str(pid)
         return PermitAction(action_type=atype, permit_id=pid)
     except Exception:
+        _LLM_FALLBACK_COUNT += 1
+        log_diag(
+            f"[WARN] llm_fallback_used total={_LLM_FALLBACK_COUNT} "
+            f"raw={text[:80]!r}"
+        )
         return PermitAction(action_type="list", permit_id=None)
             if result.done:
                 break
+        # Final score = peak progress MINUS a small per-step penalty.
+        # - max(rewards) rewards reaching a good state even if later
+        #   actions nudge it down (e.g. waste penalties or missing-doc).
+        # - step penalty (0.003 per step) rewards fast completion and
+        #   punishes dawdling. Tuned so optimal play on all tiers
+        #   (easy ~9, medium ~18, hard ~32 steps) always scores > 0.85:
+        #   hard worst case = 1.0 - 0.003*32 = 0.904.
         if rewards:
+            peak = max(rewards)
+            score = peak - 0.003 * steps_taken
+        else:
+            score = 0.0
         score = min(max(score, 0.0), 1.0)
         success = score >= SUCCESS_SCORE_THRESHOLD
     except Exception as exc:

models.py CHANGED Viewed

@@ -19,8 +19,9 @@ class PermitAction(Action):
     action_type: str = Field(
         ...,
         description=(
-            "One of: 'submit', 'pay', 'inspect', 'query', 'list'. "
-            "'list' ignores permit_id and returns all permits."
         ),
     )
     permit_id: Optional[str] = Field(

     action_type: str = Field(
         ...,
         description=(
+            "One of: 'submit', 'pay', 'inspect', 'query', 'list', 'set_task'. "
+            "'list' ignores permit_id and returns all permits. "
+            "'set_task' uses permit_id to carry the target task name."
         ),
     )
     permit_id: Optional[str] = Field(

server/permit_env_environment.py CHANGED Viewed

@@ -142,6 +142,7 @@ class PermitEnvironment(Environment):
     def __init__(self):
         """Initialize with the easy task by default."""
         self._state = State(episode_id=str(uuid4()), step_count=0)
         default_task = os.getenv("PERMIT_TASK", "easy_foodtruck")
         if default_task not in TASKS:
             default_task = "easy_foodtruck"
@@ -149,26 +150,49 @@ class PermitEnvironment(Environment):
     # ---------- Task lifecycle ----------
     def _init_task(self, task_name: str) -> None:
-        """Load a task configuration into the environment."""
         task = TASKS[task_name]
         self._task_name = task_name
-        self._budget = task["budget"]
         self._max_steps = task["max_steps"]
         self._wasted = 0
-        # Deep copy permits to avoid mutating the global TASKS dict
         self._permits = {}
-        for pid, cfg in task["permits"].items():
             self._permits[pid] = {
-                "fee": cfg["fee"],
                 "prereqs": list(cfg["prereqs"]),
                 "stage": (
                     STAGE_AVAILABLE if not cfg["prereqs"] else STAGE_LOCKED
                 ),
             }
         self._done = False
-        # Seeded randomness for missing-doc event on hard task
-        self._rng = random.Random(hash(self._state.episode_id) & 0xFFFFFFFF)
         self._missing_doc_fired = False
     def _update_unlocks(self) -> None:
@@ -216,7 +240,7 @@ class PermitEnvironment(Environment):
             total_stage += STAGE_ORDER.index(p["stage"]) / MAX_STAGE_VALUE
         base = total_stage / len(self._permits)
-        initial_budget = TASKS[self._task_name]["budget"]
         budget_frac = max(0.0, self._budget / initial_budget) if initial_budget else 0.0
         # Budget bonus only if agent has actually made meaningful progress
         budget_bonus = 0.1 * budget_frac * base
@@ -229,17 +253,21 @@ class PermitEnvironment(Environment):
     # ---------- Action helpers ----------
     def _available_actions(self) -> list:
-        """Return human-readable strings for currently legal actions."""
-        actions = ["list()", "query('<permit_id>')"]
-        for pid, p in self._permits.items():
             stage = p["stage"]
             if stage == STAGE_AVAILABLE:
-                actions.append(f"submit('{pid}')")
             elif stage == STAGE_APPROVED:
-                actions.append(f"pay('{pid}')")
             elif stage == STAGE_PAID:
-                actions.append(f"inspect('{pid}')")
-        return actions
     def _snapshot_permits(self) -> dict:
         """Serialize permits for observation payload."""
@@ -279,23 +307,43 @@ class PermitEnvironment(Environment):
     # ---------- Environment API ----------
-    def reset(self) -> PermitObservation:
-        """Reset the environment to the default (or env-var-selected) task.
-        Task switching mid-session is done by sending a 'set_task' action
-        via step() since the base Environment.reset() signature doesn't
-        accept kwargs through the HTTP/WS server layer.
         """
-        self._state = State(episode_id=str(uuid4()), step_count=0)
-        default_task = os.getenv("PERMIT_TASK", self._task_name or "easy_foodtruck")
-        if default_task not in TASKS:
-            default_task = "easy_foodtruck"
-        self._init_task(default_task)
         return self._build_observation(
             message=(
                 f"Permit environment ready. Task: {self._task_name}. "
                 f"Budget: ${self._budget:.2f}. "
-                f"Use list() to see permits, then submit/pay/inspect each."
             ),
             error=None,
         )

     def __init__(self):
         """Initialize with the easy task by default."""
         self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._seed: Optional[int] = None
         default_task = os.getenv("PERMIT_TASK", "easy_foodtruck")
         if default_task not in TASKS:
             default_task = "easy_foodtruck"
     # ---------- Task lifecycle ----------
+    def _derive_rng(self) -> random.Random:
+        """Build a deterministic RNG from (episode_id, seed, task_name)."""
+        key = f"{self._state.episode_id}|{self._seed}|{self._task_name}"
+        return random.Random(hash(key) & 0xFFFFFFFF)
     def _init_task(self, task_name: str) -> None:
+        """Load a task configuration with seeded per-episode variation.
+        Randomization injected per reset:
+          - permit iteration order is shuffled (breaks 'first-legal' tricks)
+          - fees are jittered by ±20% (breaks exact memoization of optimal
+            policies and forces the agent to read the current fee)
+          - budget is also jittered ±10% so fee/budget ratios differ
+        """
         task = TASKS[task_name]
         self._task_name = task_name
         self._max_steps = task["max_steps"]
         self._wasted = 0
+        self._rng = self._derive_rng()
+        base_budget = task["budget"]
+        budget_jitter = 1.0 + self._rng.uniform(-0.10, 0.10)
+        self._budget = round(base_budget * budget_jitter, 2)
+        # Stored so we can compute budget_frac in _compute_reward
+        self._initial_budget = self._budget
+        # Shuffled permit iteration order
+        permit_items = list(task["permits"].items())
+        self._rng.shuffle(permit_items)
         self._permits = {}
+        for pid, cfg in permit_items:
+            fee_jitter = 1.0 + self._rng.uniform(-0.20, 0.20)
+            fee = round(cfg["fee"] * fee_jitter, 2)
             self._permits[pid] = {
+                "fee": fee,
                 "prereqs": list(cfg["prereqs"]),
                 "stage": (
                     STAGE_AVAILABLE if not cfg["prereqs"] else STAGE_LOCKED
                 ),
             }
         self._done = False
         self._missing_doc_fired = False
     def _update_unlocks(self) -> None:
             total_stage += STAGE_ORDER.index(p["stage"]) / MAX_STAGE_VALUE
         base = total_stage / len(self._permits)
+        initial_budget = getattr(self, "_initial_budget", 0.0)
         budget_frac = max(0.0, self._budget / initial_budget) if initial_budget else 0.0
         # Budget bonus only if agent has actually made meaningful progress
         budget_bonus = 0.1 * budget_frac * base
     # ---------- Action helpers ----------
     def _available_actions(self) -> list:
+        """Return the set of action TYPES currently legal on at least
+        one permit. Intentionally does NOT expose permit IDs — the agent
+        must read the `permits` dict and reason about which ID to target.
+        This prevents a trivial "pick the first string" solution."""
+        types = {"list", "query"}
+        for p in self._permits.values():
             stage = p["stage"]
             if stage == STAGE_AVAILABLE:
+                types.add("submit")
             elif stage == STAGE_APPROVED:
+                types.add("pay")
             elif stage == STAGE_PAID:
+                types.add("inspect")
+        # Sorted for stable observation payload
+        return sorted(types)
     def _snapshot_permits(self) -> dict:
         """Serialize permits for observation payload."""
     # ---------- Environment API ----------
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_name: Optional[str] = None,
+        **kwargs,
+    ) -> PermitObservation:
+        """Reset the environment per OpenEnv best practice.
+        Accepts optional kwargs:
+          - seed:       deterministic RNG seed. When omitted, a fresh
+                        episode_id is used (non-deterministic).
+          - episode_id: caller-supplied episode identifier.
+          - task_name:  one of TASKS keys. Falls back to PERMIT_TASK env
+                        var, then 'easy_foodtruck'.
+        Extra kwargs are accepted silently so the HTTP server layer can
+        forward arbitrary JSON bodies (e.g. empty {}) without raising.
         """
+        self._state = State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+        )
+        self._seed = seed
+        chosen = task_name or os.getenv(
+            "PERMIT_TASK", self._task_name or "easy_foodtruck"
+        )
+        if chosen not in TASKS:
+            chosen = "easy_foodtruck"
+        self._init_task(chosen)
         return self._build_observation(
             message=(
                 f"Permit environment ready. Task: {self._task_name}. "
                 f"Budget: ${self._budget:.2f}. "
+                f"Read the 'permits' dict to see each permit's stage, "
+                f"fee, and prereqs, then submit → pay → inspect each."
             ),
             error=None,
         )