Spaces:

parvpareek
/

cache-env

Sleeping

App Files Files Community

Parv Pareek commited on Apr 13

Commit

e75c8ce

1 Parent(s): 351158b

done

Browse files

Files changed (21) hide show

README.md +123 -42
app.py +3 -37
cache_invalidation_env.egg-info/PKG-INFO +2 -0
cache_invalidation_env.egg-info/SOURCES.txt +5 -2
cache_invalidation_env.egg-info/requires.txt +3 -0
env/__init__.py +13 -1
env/cache_environment.py +156 -0
env/client.py +30 -0
env/core.py +0 -91
env/generator.py +25 -18
env/grader.py +11 -26
env/models.py +38 -11
env/task_graders.py +35 -0
env/tasks.py +26 -4
inference.py +62 -39
openenv.yaml +18 -17
pyproject.toml +3 -0
server/app.py +51 -4
tests/conftest.py +10 -0
tests/test_phase1.py +73 -0
uv.lock +43 -0

README.md CHANGED Viewed

@@ -15,54 +15,139 @@ pinned: false
 **Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
-**Our approach:** We simulate several cache **items** per episode. Each item has hidden staleness dynamics (TTL, update rate). The API only exposes **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks an action **per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **grader** produces a **final score in [0, 1]** from correctness, wasted invalidations, and stability.
-**Tasks:** Three difficulties — **easy**, **medium**, **hard** — differ by number of items and how volatile hidden state is, so the same policy can be compared across noise levels.
 ---
-## API (OpenEnv-style HTTP)
-| Method | Path | Role |
-|--------|------|------|
-| POST | `/reset` | New episode; returns `state` and `task_id` |
-| POST | `/step` | JSON body `{"type":"keep\|refresh\|invalidate","key":"item_0"}`; returns `state`, `reward`, `done`, optional `final_score` when episode ends |
-| GET | `/state` | Current observation |
-**Deployed Space (example):** `https://parvpareek-cache-env.hf.space` — ping with:
 ```bash
 curl -s -o /dev/null -w '%{http_code}\n' -X POST \
   -H 'Content-Type: application/json' -d '{}' \
-  'https://parvpareek-cache-env.hf.space/reset'
 ```
 Expect `200`.
-**Local run:** `pip install -r requirements.txt` then `uvicorn app:app --host 0.0.0.0 --port 7860` (or use the Dockerfile).
 ---
 ## Baseline inference (`inference.py`)
-- Uses the **OpenAI Python client** with **`API_BASE_URL`**, **`MODEL_NAME`**, and **`HF_TOKEN`** (set as environment variables or in a local `.env` loaded by `inference.py`; never commit tokens).
-- Talks to the **Space URL** above (override with `ENV_URL` if needed).
-- Prints exactly **`[START]`**, one **`[STEP]`** per env step, and **`[END]`** with `score` and `rewards` as required by the challenge spec.
-Run:
 ```bash
-export API_BASE_URL='https://router.huggingface.co/v1'
-export MODEL_NAME='<model your account can call>'
-export HF_TOKEN='hf_...'
 python inference.py
 ```
 ---
-## Validation (pre-submission)
-From the repo root:
 ```bash
 openenv validate
@@ -72,35 +157,31 @@ docker build .
 ---
-## Repository layout (high level)
 | Path | Purpose |
 |------|---------|
-| `app.py` | FastAPI app: `/reset`, `/step`, `/state` |
-| `env/` | Environment logic, tasks, grading, generation |
-| `openenv.yaml` | OpenEnv metadata |
-| `inference.py` | Baseline agent + structured logs |
-| `Dockerfile` | Space / CI image |
-| `pyproject.toml`, `uv.lock`, `server/app.py` | `openenv validate` / multi-mode layout |
 ---
-## Scoring (short)
-- **Per-step reward:** Shaped table (e.g. invalidate when stale is good; invalidate when fresh is penalized). Values can be negative in the middle of an episode.
-- **Episode `final_score` (when `done`):** Normalized grader in **[0, 1]** combining decision quality, unnecessary invalidations, and oscillation.
 ---
-## Summary
-| Criterion | Status |
-|-----------|--------|
-| Real-world task (not a toy game) | Cache invalidation under uncertainty |
-| `reset` / `step` / `state` | Implemented |
-| `openenv.yaml` | Present |
-| 3 tasks + grader | `easy` / `medium` / `hard` |
-| Meaningful rewards | Dense step reward + episode score in [0, 1] |
-| Baseline | `inference.py` + OpenAI client + stdout format |
-If anything fails in automated checks, compare your **Space app URL** (`*.hf.space`) and **pushed commit** to what you submit.

 **Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
+**Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**.
+**Tasks:** **easy → medium → hard** — more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**.
 ---
+## OpenEnv spec compliance
+- **Typed models:** `env/models.py` — `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
+- **Environment:** `env/cache_environment.py` — `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
+- **HTTP server:** `server/app.py` — `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery.
+- **Manifest:** `openenv.yaml` — `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
+- **Client (WebSocket):** `env/client.py` — `CacheInvalidationEnvClient` for typed `EnvClient` usage.
+- **Shim:** `app.py` re-exports `app` for `uvicorn app:app`.
+Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default).
+---
+## Action & observation
+**Action (POST `/step` body, OpenEnv wrapped form):**
+```json
+{
+  "action": {
+    "type": "invalidate",
+    "key": "item_0"
+  }
+}
+```
+`type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.
+**Reset (POST `/reset`):**
+```json
+{
+  "seed": 42,
+  "task_id": "easy"
+}
+```
+Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible.
+**Response shape (reset & step):**
+```json
+{
+  "observation": {
+    "items": [...],
+    "step": 0,
+    "task_id": "easy",
+    "final_score": null,
+    "done": false
+  },
+  "reward": 0.0,
+  "done": false
+}
+```
+When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**.
+---
+## Tasks and graders
+- **Registry:** `env/task_graders.py` — `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
+- **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
+- **Episode grader:** `env/grader.py` — `evaluate_episode` (freshness, unnecessary invalidations, oscillation).
+---
+## Setup & run
+**Install (dev):**
+```bash
+uv sync --extra dev
+```
+**Local server:**
+```bash
+uv run server
+# or
+uvicorn app:app --host 0.0.0.0 --port 7860
+```
+**Health check:**
 ```bash
 curl -s -o /dev/null -w '%{http_code}\n' -X POST \
   -H 'Content-Type: application/json' -d '{}' \
+  'http://127.0.0.1:7860/reset'
 ```
 Expect `200`.
+**Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**).
 ---
 ## Baseline inference (`inference.py`)
+- Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses.
+- **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
+- **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
+- Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required).
 ```bash
+export ENV_URL='http://127.0.0.1:7860'   # or your Space https://....hf.space
+export EPISODE_SEED=42
+export TASK_ID=easy
 python inference.py
+# Phase-1 style: one process, three tasks
+RUN_ALL_TASKS=1 python inference.py
 ```
 ---
+## Tests (Phase 1 checks)
+```bash
+uv run pytest tests/ -q
+```
+Covers: `GET /tasks` (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.
+---
+## Validation (pre-submission)
 ```bash
 openenv validate
 ---
+## Repository layout
 | Path | Purpose |
 |------|---------|
+| `env/models.py` | Typed Action / Observation / State |
+| `env/cache_environment.py` | `Environment` implementation |
+| `env/grader.py` | Step rewards + episode `evaluate_episode` |
+| `env/task_graders.py` | **Three named agent graders** (registry) |
+| `env/tasks.py` | Task configs + `TASK_MANIFEST` |
+| `env/client.py` | Typed WebSocket `EnvClient` |
+| `server/app.py` | `create_fastapi_app` + `/tasks` |
+| `app.py` | Uvicorn entry shim |
+| `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs |
+| `openenv.yaml` | Full OpenEnv manifest |
+| `tests/` | Phase 1 pytest |
 ---
+## Scoring
+- **Per-step `reward`:** Shaped (can be negative mid-episode).
+- **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability.
 ---
+## Resource notes
+Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode × 3 tasks).

app.py CHANGED Viewed

@@ -1,39 +1,5 @@
-from fastapi import Body, FastAPI
-from pydantic import BaseModel, ConfigDict
-from env.core import CacheEnv
-from env.tasks import TASK_MANIFEST
-app = FastAPI()
-env = CacheEnv()
-class ResetBody(BaseModel):
-    model_config = ConfigDict(extra="ignore")
-    task_id: str | None = None
-    task_name: str | None = None
-@app.post("/reset")
-def reset(body: ResetBody = Body(default_factory=ResetBody)):
-    task_key = body.task_id or body.task_name
-    state = env.reset(task_id=task_key)
-    return {
-        "state": state,
-        "task_id": state.get("task_id"),
-    }
-@app.get("/tasks")
-def list_tasks():
-    """Hub validators use this to discover tasks that expose episode grading (final_score)."""
-    return {"tasks": TASK_MANIFEST}
-@app.post("/step")
-def step(action: dict):
-    return env.step(action)
-@app.get("/state")
-def state():
-    return env.get_state()

+"""Shim for `uvicorn app:app` (Docker / local one-liners)."""
+from server.app import app
+__all__ = ["app"]

cache_invalidation_env.egg-info/PKG-INFO CHANGED Viewed

@@ -10,3 +10,5 @@ Requires-Dist: pydantic>=2.0.0
 Requires-Dist: requests>=2.28.0
 Requires-Dist: openai>=1.0.0
 Requires-Dist: python-dotenv>=1.0.0

 Requires-Dist: requests>=2.28.0
 Requires-Dist: openai>=1.0.0
 Requires-Dist: python-dotenv>=1.0.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == "dev"

cache_invalidation_env.egg-info/SOURCES.txt CHANGED Viewed

@@ -7,10 +7,13 @@ cache_invalidation_env.egg-info/entry_points.txt
 cache_invalidation_env.egg-info/requires.txt
 cache_invalidation_env.egg-info/top_level.txt
 env/__init__.py
-env/core.py
 env/generator.py
 env/grader.py
 env/models.py
 env/tasks.py
 server/__init__.py
-server/app.py

 cache_invalidation_env.egg-info/requires.txt
 cache_invalidation_env.egg-info/top_level.txt
 env/__init__.py
+env/cache_environment.py
+env/client.py
 env/generator.py
 env/grader.py
 env/models.py
+env/task_graders.py
 env/tasks.py
 server/__init__.py
+server/app.py
+tests/test_phase1.py

cache_invalidation_env.egg-info/requires.txt CHANGED Viewed

@@ -5,3 +5,6 @@ pydantic>=2.0.0
 requests>=2.28.0
 openai>=1.0.0
 python-dotenv>=1.0.0

 requests>=2.28.0
 openai>=1.0.0
 python-dotenv>=1.0.0
+[dev]
+pytest>=8.0

env/__init__.py CHANGED Viewed

	@@ -1 +1,13 @@
1	- # Cache invalidation ~~environment~~ package

+"""Cache invalidation OpenEnv package."""
+from env.cache_environment import CacheInvalidationEnvironment
+from env.client import CacheInvalidationEnvClient
+from env.models import CacheAction, CacheObservation, CacheState
+__all__ = [
+    "CacheAction",
+    "CacheObservation",
+    "CacheState",
+    "CacheInvalidationEnvironment",
+    "CacheInvalidationEnvClient",
+]

env/cache_environment.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""OpenEnv Environment: cache invalidation under partial observability."""
+from __future__ import annotations
+import random
+from typing import Any, Optional
+from openenv.core.env_server import Environment
+from openenv.core.env_server.types import EnvironmentMetadata
+from env.generator import generate_env
+from env.grader import compute_step_reward, evaluate_episode
+from env.models import CacheAction, CacheItem, CacheObservation, CacheState
+from env.tasks import sample_task
+class CacheInvalidationEnvironment(Environment[CacheAction, CacheObservation, CacheState]):
+    """Stateful cache control: invalidate, refresh, or keep per step (one key)."""
+    SUPPORTS_CONCURRENT_SESSIONS = False
+    def __init__(self) -> None:
+        super().__init__()
+        self._rng: random.Random | type[random] = random
+        self.history: list[dict[str, Any]] = []
+        self.task_id: str = "easy"
+        self.hidden: list[dict[str, Any]] = []
+        self.current_time: int = 0
+        self._items: list[dict[str, Any]] = []
+        self._step: int = 0
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_id: Optional[str] = None,
+        task_name: Optional[str] = None,
+        **kwargs: Any,
+    ) -> CacheObservation:
+        tid = task_id or task_name or kwargs.get("task_id") or kwargs.get("task_name")
+        self._reset_rubric()
+        if seed is not None:
+            self._rng = random.Random(int(seed))
+        else:
+            self._rng = random
+        self.history = []
+        if tid in ("easy", "medium", "hard"):
+            self.task_id = tid
+        else:
+            self.task_id = sample_task(self._rng)
+        items, hidden, current_time = generate_env(self.task_id, rng=self._rng)
+        self._items = items
+        self.hidden = hidden
+        self.current_time = current_time
+        self._step = 0
+        return self._observation(
+            reward=None,
+            done=False,
+            final_score=None,
+        )
+    def step(
+        self,
+        action: CacheAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> CacheObservation:
+        key = action.key
+        action_type = action.type
+        item_index = next(
+            (i for i, x in enumerate(self._items) if x["key"] == key), None
+        )
+        if item_index is None:
+            return self._observation(reward=-1.0, done=True, final_score=None)
+        hidden = self.hidden[item_index]
+        item = self._items[item_index]
+        age = self.current_time - hidden["last_update"]
+        is_stale = age > hidden["base_ttl"] or self._rng.random() < hidden["update_freq"]
+        self.history.append({"action": action_type, "is_stale": is_stale})
+        reward = compute_step_reward(action_type, is_stale)
+        if action_type == "invalidate":
+            hidden["last_update"] = self.current_time
+            item["age"] = 0
+        elif action_type == "refresh":
+            hidden["last_update"] = self.current_time - 1
+            item["age"] = 1
+        elif action_type == "keep":
+            item["age"] += 1
+        item["last_result"] = (
+            "stale"
+            if is_stale and self._rng.random() < 0.7
+            else "hit"
+            if not is_stale or self._rng.random() < 0.9
+            else "stale"
+        )
+        self.current_time += 1
+        self._step += 1
+        done = self._step >= 10
+        final_score = evaluate_episode(self.history) if done else None
+        return self._observation(
+            reward=reward,
+            done=done,
+            final_score=final_score,
+        )
+    @property
+    def state(self) -> CacheState:
+        return CacheState(
+            episode_id=None,
+            step_count=self._step,
+            task_id=self.task_id,
+            items=[CacheItem.model_validate(x) for x in self._items],
+        )
+    def get_metadata(self) -> EnvironmentMetadata:
+        return EnvironmentMetadata(
+            name="cache_invalidation_env",
+            description=(
+                "Cache invalidation under uncertainty: choose invalidate, refresh, or keep "
+                "per step from noisy hit/stale observations."
+            ),
+            version="1.0.0",
+        )
+    def _observation(
+        self,
+        *,
+        reward: float | None,
+        done: bool,
+        final_score: float | None,
+    ) -> CacheObservation:
+        return CacheObservation(
+            done=done,
+            reward=reward,
+            items=[CacheItem.model_validate(x) for x in self._items],
+            step=self._step,
+            task_id=self.task_id,
+            final_score=final_score,
+        )

env/client.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""Typed WebSocket client for CacheInvalidationEnvironment."""
+from __future__ import annotations
+from typing import Any, Dict
+from openenv.core.client_types import StepResult
+from openenv.core.env_client import EnvClient
+from env.models import CacheAction, CacheObservation, CacheState
+class CacheInvalidationEnvClient(EnvClient[CacheAction, CacheObservation, CacheState]):
+    def _step_payload(self, action: CacheAction | Dict[str, Any]) -> Dict[str, Any]:
+        if isinstance(action, CacheAction):
+            return action.model_dump()
+        return CacheAction.model_validate(action).model_dump()
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[CacheObservation]:
+        obs_inner = payload.get("observation", {})
+        return StepResult(
+            observation=CacheObservation.model_validate(
+                {**obs_inner, "reward": payload.get("reward"), "done": payload.get("done", False)}
+            ),
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> CacheState:
+        return CacheState.model_validate(payload)

env/core.py DELETED Viewed

@@ -1,91 +0,0 @@
-import random
-from env.generator import generate_env
-from env.grader import compute_step_reward
-from env.tasks import sample_task
-class CacheEnv:
-    def __init__(self):
-        self.reset()
-    def reset(self, task_id=None):
-        self.history = []
-        if task_id in ("easy", "medium", "hard"):
-            self.task_id = task_id
-        else:
-            self.task_id = sample_task()
-        items, hidden, current_time = generate_env(self.task_id)
-        self.state = {
-            "items": items,
-            "step": 0,
-            "task_id": self.task_id
-        }
-        self.hidden = hidden
-        self.current_time = current_time
-        self.total_reward = 0
-        return self.state
-    def step(self, action):
-        key = action.get("key")
-        action_type = action.get("type")
-        item_index = next((i for i, x in enumerate(self.state["items"]) if x["key"] == key), None)
-        if item_index is None:
-            return {"state": self.state, "reward": -1.0, "done": True}
-        hidden = self.hidden[item_index]
-        item = self.state["items"][item_index]
-        # hidden staleness
-        age = self.current_time - hidden["last_update"]
-        is_stale = age > hidden["base_ttl"] or random.random() < hidden["update_freq"]
-        self.history.append({
-            "action": action_type,
-            "is_stale": is_stale
-        })
-        reward = compute_step_reward(action_type, is_stale)
-        self.total_reward += reward
-        # apply action
-        if action_type == "invalidate":
-            hidden["last_update"] = self.current_time
-            item["age"] = 0
-        elif action_type == "refresh":
-            hidden["last_update"] = self.current_time - 1
-            item["age"] = 1
-        elif action_type == "keep":
-            item["age"] += 1
-        # noisy observation
-        item["last_result"] = (
-            "stale" if is_stale and random.random() < 0.7
-            else "hit" if not is_stale or random.random() < 0.9
-            else "stale"
-        )
-        self.current_time += 1
-        self.state["step"] += 1
-        done = self.state["step"] >= 10
-        from env.grader import evaluate_episode
-        if done:
-            final_score = evaluate_episode(self.history)
-        else:
-            final_score = None
-        return {
-            "state": self.state,
-            "reward": reward,
-            "done": done,
-            "task_id": self.task_id,
-            "final_score": final_score
-        }
-    def get_state(self):
-        return self.state

env/generator.py CHANGED Viewed

@@ -1,7 +1,10 @@
 import random
 from env.tasks import get_task
-def generate_env(task_id):
     config = get_task(task_id)
     state_items = []
@@ -10,27 +13,31 @@ def generate_env(task_id):
     current_time = 0
     for i in range(config["num_items"]):
-        base_ttl = random.randint(3, 8)
-        update_freq = random.uniform(0.1, config["volatility"])
-        last_update = random.randint(0, 3)
         age = current_time - last_update
-        is_stale = age > base_ttl or random.random() < update_freq
-        last_result = "stale" if is_stale and random.random() < 0.7 else "hit"
-        state_items.append({
-            "key": f"item_{i}",
-            "age": max(age, 0),
-            "access_count": random.randint(1, 20),
-            "last_result": last_result
-        })
-        hidden_items.append({
-            "base_ttl": base_ttl,
-            "update_freq": update_freq,
-            "last_update": last_update
-        })
-    return state_items, hidden_items, current_time

 import random
 from env.tasks import get_task
+def generate_env(task_id, rng=None):
+    """Build initial items and hidden dynamics. Use *rng* for reproducible episodes."""
+    r = rng if rng is not None else random
     config = get_task(task_id)
     state_items = []
     current_time = 0
     for i in range(config["num_items"]):
+        base_ttl = r.randint(3, 8)
+        update_freq = r.uniform(0.1, config["volatility"])
+        last_update = r.randint(0, 3)
         age = current_time - last_update
+        is_stale = age > base_ttl or r.random() < update_freq
+        last_result = "stale" if is_stale and r.random() < 0.7 else "hit"
+        state_items.append(
+            {
+                "key": f"item_{i}",
+                "age": max(age, 0),
+                "access_count": r.randint(1, 20),
+                "last_result": last_result,
+            }
+        )
+        hidden_items.append(
+            {
+                "base_ttl": base_ttl,
+                "update_freq": update_freq,
+                "last_update": last_update,
+            }
+        )
+    return state_items, hidden_items, current_time

env/grader.py CHANGED Viewed

@@ -1,9 +1,6 @@
-# Submission validators require final scores strictly in (0, 1), not at the endpoints.
-_SCORE_EPS = 1e-4
-def clamp_strict_unit_interval(x: float) -> float:
-    return float(min(1.0 - _SCORE_EPS, max(_SCORE_EPS, x)))
 def compute_step_reward(action_type, is_stale):
@@ -20,11 +17,10 @@ def compute_step_reward(action_type, is_stale):
     return reward
 def normalize_episode_score(total_reward, max_steps=10):
-    # expected max ≈ 1.0 per step
     score = total_reward / max_steps
-    return clamp_strict_unit_interval(max(0.0, min(1.0, score)))
 def evaluate_episode(history):
@@ -35,11 +31,10 @@ def evaluate_episode(history):
         "is_stale": bool
     }
     """
     total_steps = len(history)
     if total_steps == 0:
-        return clamp_strict_unit_interval(0.0)
     correct_decisions = 0
     unnecessary_invalidations = 0
@@ -51,33 +46,23 @@ def evaluate_episode(history):
         action = step["action"]
         is_stale = step["is_stale"]
-        # ✅ correctness (freshness proxy)
-        if (is_stale and action in ["invalidate", "refresh"]) or \
-           (not is_stale and action == "keep"):
             correct_decisions += 1
-        # ❌ unnecessary invalidation
         if action == "invalidate" and not is_stale:
             unnecessary_invalidations += 1
-        # ❌ oscillation (flip behavior)
         if last_action and last_action != action:
             oscillations += 1
         last_action = action
-    # ---- normalize metrics ----
     freshness = correct_decisions / total_steps
     efficiency = 1 - (unnecessary_invalidations / total_steps)
     stability = 1 - (oscillations / total_steps)
-    # ---- weighted score ----
-    score = (
-        0.5 * freshness +
-        0.3 * efficiency +
-        0.2 * stability
-    )
-    return clamp_strict_unit_interval(max(0.0, min(1.0, score)))

+def clamp_unit_interval(x: float) -> float:
+    """Clamp to [0.0, 1.0] (Phase 1 / rubric)."""
+    return max(0.0, min(1.0, float(x)))
 def compute_step_reward(action_type, is_stale):
     return reward
 def normalize_episode_score(total_reward, max_steps=10):
     score = total_reward / max_steps
+    return clamp_unit_interval(score)
 def evaluate_episode(history):
         "is_stale": bool
     }
     """
     total_steps = len(history)
     if total_steps == 0:
+        return clamp_unit_interval(0.0)
     correct_decisions = 0
     unnecessary_invalidations = 0
         action = step["action"]
         is_stale = step["is_stale"]
+        if (is_stale and action in ["invalidate", "refresh"]) or (
+            not is_stale and action == "keep"
+        ):
             correct_decisions += 1
         if action == "invalidate" and not is_stale:
             unnecessary_invalidations += 1
         if last_action and last_action != action:
             oscillations += 1
         last_action = action
     freshness = correct_decisions / total_steps
     efficiency = 1 - (unnecessary_invalidations / total_steps)
     stability = 1 - (oscillations / total_steps)
+    score = 0.5 * freshness + 0.3 * efficiency + 0.2 * stability
+    return clamp_unit_interval(score)

env/models.py CHANGED Viewed

@@ -1,16 +1,43 @@
-from pydantic import BaseModel
-from typing import List
 class CacheItem(BaseModel):
     key: str
-    age: int
-    access_count: int
-    last_result: str  # "hit" or "stale"
-class State(BaseModel):
-    items: List[CacheItem]
-    step: int
-class Action(BaseModel):
-    type: str  # invalidate | keep | refresh
-    key: str

+"""Typed OpenEnv contracts: Action, Observation, State."""
+from __future__ import annotations
+from typing import Literal
+from openenv.core.env_server import Action, Observation, State
+from pydantic import BaseModel, ConfigDict, Field
 class CacheItem(BaseModel):
+    model_config = ConfigDict(extra="allow")
     key: str
+    age: int = Field(ge=0)
+    access_count: int = Field(ge=0)
+    last_result: str
+class CacheAction(Action):
+    """Per-step decision for one cache key."""
+    type: Literal["invalidate", "refresh", "keep"]
+    key: str
+class CacheObservation(Observation):
+    """What the agent sees (no hidden TTL / true staleness)."""
+    items: list[CacheItem] = Field(default_factory=list)
+    step: int = Field(default=0, ge=0)
+    task_id: str = ""
+    final_score: float | None = Field(
+        default=None,
+        description="Episode grader output in [0,1] when done=True; else None.",
+    )
+class CacheState(State):
+    """Server-visible state (no hidden dynamics)."""
+    task_id: str = ""
+    items: list[CacheItem] = Field(default_factory=list)

env/task_graders.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""
+Registered agent graders — one enabled grader per task (easy / medium / hard).
+Automated checks count tasks that declare a grader and can run episode scoring.
+All three share the same history-based rubric; difficulty is enforced by the
+environment dynamics (items + volatility), not by different formulas.
+"""
+from __future__ import annotations
+from typing import Any, Callable, Dict, List
+from env.grader import evaluate_episode
+History = List[Dict[str, Any]]
+def easy_agent_grader(history: History) -> float:
+    return evaluate_episode(history)
+def medium_agent_grader(history: History) -> float:
+    return evaluate_episode(history)
+def hard_agent_grader(history: History) -> float:
+    return evaluate_episode(history)
+# Explicit registry (imported by server /tasks and static analysis)
+TASK_AGENT_GRADERS: Dict[str, Callable[[History], float]] = {
+    "easy": easy_agent_grader,
+    "medium": medium_agent_grader,
+    "hard": hard_agent_grader,
+}

env/tasks.py CHANGED Viewed

@@ -1,6 +1,8 @@
 import random
-# Declared for GET /tasks and openenv.yaml (submission validators count tasks with graders).
 TASK_MANIFEST = [
     {
         "name": "easy",
@@ -10,6 +12,8 @@ TASK_MANIFEST = [
         "difficulty": "easy",
         "max_steps": 10,
         "grader": True,
         "score_range": [0.0, 1.0],
     },
     {
@@ -20,6 +24,8 @@ TASK_MANIFEST = [
         "difficulty": "medium",
         "max_steps": 10,
         "grader": True,
         "score_range": [0.0, 1.0],
     },
     {
@@ -30,6 +36,8 @@ TASK_MANIFEST = [
         "difficulty": "hard",
         "max_steps": 10,
         "grader": True,
         "score_range": [0.0, 1.0],
     },
 ]
@@ -47,8 +55,22 @@ def get_task(task_id):
     else:
         return {
             "num_items": 3,
-            "volatility": 0.3
         }
-def sample_task():
-    return random.choice(["easy", "medium", "hard"])

 import random
+from env.task_graders import TASK_AGENT_GRADERS
+# Declared for GET /tasks + openenv.yaml (Phase 1 task/grader discovery).
 TASK_MANIFEST = [
     {
         "name": "easy",
         "difficulty": "easy",
         "max_steps": 10,
         "grader": True,
+        "grader_kind": "programmatic",
+        "grader_callable": "env.task_graders:easy_agent_grader",
         "score_range": [0.0, 1.0],
     },
     {
         "difficulty": "medium",
         "max_steps": 10,
         "grader": True,
+        "grader_kind": "programmatic",
+        "grader_callable": "env.task_graders:medium_agent_grader",
         "score_range": [0.0, 1.0],
     },
     {
         "difficulty": "hard",
         "max_steps": 10,
         "grader": True,
+        "grader_kind": "programmatic",
+        "grader_callable": "env.task_graders:hard_agent_grader",
         "score_range": [0.0, 1.0],
     },
 ]
     else:
         return {
             "num_items": 3,
+            "volatility": 0.3,
         }
+def sample_task(rng=None):
+    r = rng if rng is not None else random
+    return r.choice(["easy", "medium", "hard"])
+def list_graders():
+    """Return task ids that have an enabled agent grader."""
+    return [
+        {
+            "task": name,
+            "grader_enabled": fn is not None,
+            "callable": getattr(fn, "__name__", str(fn)),
+        }
+        for name, fn in TASK_AGENT_GRADERS.items()
+    ]

inference.py CHANGED Viewed

@@ -3,14 +3,13 @@ import os
 import sys
 import textwrap
 from pathlib import Path
-from typing import List, Optional
 import requests
 from openai import OpenAI
-from env.grader import clamp_strict_unit_interval
-# Load .env from repo root so HF_TOKEN / API_BASE_URL work when you run: python inference.py
 try:
     from dotenv import load_dotenv
@@ -18,33 +17,36 @@ try:
 except ImportError:
     pass
-# ---- Mandatory env (see hackathon spec) ----
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
-# HF deprecated api-inference.huggingface.co (410); router is the supported OpenAI-compatible host.
 MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
-ENV_URL = os.getenv("ENV_URL", "https://parvpareek-cache-env.hf.space")
 BENCHMARK = "cache_invalidation_env"
 if not API_KEY:
     print(
-        "WARNING: HF_TOKEN is not set. LLM calls will fail; the script will fall back to the "
-        "heuristic policy. Set HF_TOKEN in the environment or in a .env file next to inference.py.",
         file=sys.stderr,
     )
 client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf-invalid")
-MEMORY = {}
-LAST_USED = None
 SYSTEM_PROMPT = textwrap.dedent(
     """
-    You are a cache invalidation agent. Given the environment state (JSON), reply with exactly one JSON object
     on a single line, no markdown, with keys "type" and "key". type must be one of: invalidate, refresh, keep.
-    key must match one of the item keys in state["items"].
     """
 ).strip()
@@ -72,11 +74,11 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
     )
-def select_item(state, step):
     global LAST_USED
-    items = state["items"]
-    def score(item):
         s = 0
         if item["last_result"] == "stale":
             s += 3
@@ -98,7 +100,7 @@ def select_item(state, step):
     return best
-def decide(item, step):
     key = item["key"]
     last_result = item["last_result"]
     age = item["age"]
@@ -123,8 +125,7 @@ def decide(item, step):
     return {"type": "keep", "key": key}
-def llm_action(state) -> Optional[dict]:
-    """Call HF OpenAI-compatible API; return None on any failure so caller can fall back."""
     try:
         completion = client.chat.completions.create(
             model=MODEL_NAME,
@@ -133,7 +134,7 @@ def llm_action(state) -> Optional[dict]:
                 {
                     "role": "user",
                     "content": (
-                        f"State:\n{json.dumps(state)}\n\n"
                         'Return JSON only: {"type": "...", "key": "..."}'
                     ),
                 },
@@ -156,7 +157,8 @@ def llm_action(state) -> Optional[dict]:
     return None
-def run() -> None:
     global LAST_USED
     LAST_USED = None
     MEMORY.clear()
@@ -165,26 +167,28 @@ def run() -> None:
     steps_taken = 0
     episode_score = 0.0
     success = False
     try:
-        score_from_env = False
         res = requests.post(
-            f"{ENV_URL}/reset",
-            json={},
             headers={"Content-Type": "application/json"},
             timeout=60,
         )
         res.raise_for_status()
         body = res.json()
-        state = body.get("state", body)
-        task_id = str(body.get("task_id", "unknown"))
-        log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
         for step in range(1, 11):
-            item = select_item(state, step)
-            action = llm_action(state)
             if action is None:
                 action = decide(item, step)
@@ -194,21 +198,22 @@ def run() -> None:
             }
             step_res = requests.post(
-                f"{ENV_URL}/step",
-                json=action,
                 headers={"Content-Type": "application/json"},
                 timeout=60,
             )
             step_res.raise_for_status()
             data = step_res.json()
-            reward = float(data["reward"])
             done = bool(data["done"])
             rewards.append(reward)
             steps_taken = step
-            if data.get("final_score") is not None:
-                episode_score = float(data["final_score"])
                 score_from_env = True
             log_step(
@@ -219,8 +224,7 @@ def run() -> None:
                 error=None,
             )
-            state = data["state"]
             if done:
                 break
@@ -229,13 +233,13 @@ def run() -> None:
             success = avg_r > 0.3
         if not score_from_env and rewards:
             avg_r = sum(rewards) / len(rewards)
-            episode_score = max(0.0, min(1.0, (avg_r + 1.0) / 2.0))
     except Exception as exc:
         success = False
         print(f"[RUN] fatal: {exc}", file=sys.stderr)
     finally:
-        episode_score = clamp_strict_unit_interval(episode_score)
         log_end(
             success=success,
             steps=steps_taken,
@@ -244,5 +248,24 @@ def run() -> None:
         )
 if __name__ == "__main__":
     run()

 import sys
 import textwrap
 from pathlib import Path
+from typing import Any, Dict, List, Optional
 import requests
 from openai import OpenAI
+from env.grader import clamp_unit_interval
 try:
     from dotenv import load_dotenv
 except ImportError:
     pass
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+ENV_URL = os.getenv(
+    "ENV_URL",
+    "http://127.0.0.1:7860",
+).rstrip("/")
 BENCHMARK = "cache_invalidation_env"
+# Reproducibility (Phase 1 / baseline): fixed seed + task → deterministic heuristic run.
+EPISODE_SEED = int(os.getenv("EPISODE_SEED", "42"))
+TASK_ID = os.getenv("TASK_ID", "easy")
 if not API_KEY:
     print(
+        "WARNING: HF_TOKEN is not set. LLM calls will fail; the script will use the "
+        "heuristic policy only.",
         file=sys.stderr,
     )
 client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf-invalid")
+MEMORY: Dict[str, Any] = {}
+LAST_USED: Optional[str] = None
 SYSTEM_PROMPT = textwrap.dedent(
     """
+    You are a cache invalidation agent. Given the environment observation (JSON), reply with exactly one JSON object
     on a single line, no markdown, with keys "type" and "key". type must be one of: invalidate, refresh, keep.
+    key must match one of the item keys in observation["items"].
     """
 ).strip()
     )
+def select_item(obs: Dict[str, Any], step: int) -> Dict[str, Any]:
     global LAST_USED
+    items = obs["items"]
+    def score(item: Dict[str, Any]) -> int:
         s = 0
         if item["last_result"] == "stale":
             s += 3
     return best
+def decide(item: Dict[str, Any], step: int) -> Dict[str, str]:
     key = item["key"]
     last_result = item["last_result"]
     age = item["age"]
     return {"type": "keep", "key": key}
+def llm_action(obs: Dict[str, Any]) -> Optional[dict]:
     try:
         completion = client.chat.completions.create(
             model=MODEL_NAME,
                 {
                     "role": "user",
                     "content": (
+                        f"Observation:\n{json.dumps(obs)}\n\n"
                         'Return JSON only: {"type": "...", "key": "..."}'
                     ),
                 },
     return None
+def run_episode(*, env_url: str, task_id: str, seed: int, use_llm: bool) -> None:
+    """One episode over OpenEnv HTTP API (wrapped action + observation)."""
     global LAST_USED
     LAST_USED = None
     MEMORY.clear()
     steps_taken = 0
     episode_score = 0.0
     success = False
+    score_from_env = False
     try:
         res = requests.post(
+            f"{env_url}/reset",
+            json={"seed": seed, "task_id": task_id},
             headers={"Content-Type": "application/json"},
             timeout=60,
         )
         res.raise_for_status()
         body = res.json()
+        obs = body.get("observation", body)
+        tid = str(obs.get("task_id", task_id))
+        log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
         for step in range(1, 11):
+            item = select_item(obs, step)
+            action: Optional[dict] = None
+            if use_llm:
+                action = llm_action(obs)
             if action is None:
                 action = decide(item, step)
             }
             step_res = requests.post(
+                f"{env_url}/step",
+                json={"action": action},
                 headers={"Content-Type": "application/json"},
                 timeout=60,
             )
             step_res.raise_for_status()
             data = step_res.json()
+            reward = float(data["reward"] if data["reward"] is not None else 0.0)
             done = bool(data["done"])
             rewards.append(reward)
             steps_taken = step
+            inner = data.get("observation", {})
+            if inner.get("final_score") is not None:
+                episode_score = float(inner["final_score"])
                 score_from_env = True
             log_step(
                 error=None,
             )
+            obs = inner
             if done:
                 break
             success = avg_r > 0.3
         if not score_from_env and rewards:
             avg_r = sum(rewards) / len(rewards)
+            episode_score = clamp_unit_interval((avg_r + 1.0) / 2.0)
     except Exception as exc:
         success = False
         print(f"[RUN] fatal: {exc}", file=sys.stderr)
     finally:
+        episode_score = clamp_unit_interval(episode_score)
         log_end(
             success=success,
             steps=steps_taken,
         )
+def run() -> None:
+    use_llm = bool(API_KEY and API_KEY != "hf-invalid")
+    if os.getenv("RUN_ALL_TASKS", "").lower() in ("1", "true", "yes"):
+        for tid in ("easy", "medium", "hard"):
+            run_episode(
+                env_url=ENV_URL,
+                task_id=tid,
+                seed=EPISODE_SEED,
+                use_llm=use_llm,
+            )
+        return
+    run_episode(
+        env_url=ENV_URL,
+        task_id=TASK_ID,
+        seed=EPISODE_SEED,
+        use_llm=use_llm,
+    )
 if __name__ == "__main__":
     run()

openenv.yaml CHANGED Viewed

@@ -1,8 +1,14 @@
 name: cache_invalidation_env
 version: "1.0.0"
 description: >
-  Decision-making environment for cache invalidation under uncertainty.
-  Three difficulty levels; each task has an episode grader (final_score on done).
 tasks:
   - name: easy
@@ -10,6 +16,8 @@ tasks:
     difficulty: easy
     max_steps: 10
     grader: true
     score_range: [0.0, 1.0]
   - name: medium
@@ -17,31 +25,24 @@ tasks:
     difficulty: medium
     max_steps: 10
     grader: true
     score_range: [0.0, 1.0]
   - name: hard
-    description: "Most items and high volatility; staleness signal is noisy and costly mistakes are easier."
     difficulty: hard
     max_steps: 10
     grader: true
     score_range: [0.0, 1.0]
-actions:
-  type: object
-  properties:
-    type:
-      type: string
-    key:
-      type: string
-observations:
-  type: object
-reward:
-  type: float
 endpoints:
   reset: POST /reset
   step: POST /step
   state: GET /state
   tasks: GET /tasks

+spec_version: 1
 name: cache_invalidation_env
 version: "1.0.0"
+type: space
+runtime: fastapi
+app: server.app:app
+port: 7860
 description: >
+  Cache invalidation under uncertainty: agents choose invalidate, refresh, or keep per step
+  from noisy hit/stale observations. Three difficulty tasks (easy → hard), each with a
+  programmatic episode grader (final_score in [0,1]).
 tasks:
   - name: easy
     difficulty: easy
     max_steps: 10
     grader: true
+    grader_kind: programmatic
+    grader_callable: env.task_graders:easy_agent_grader
     score_range: [0.0, 1.0]
   - name: medium
     difficulty: medium
     max_steps: 10
     grader: true
+    grader_kind: programmatic
+    grader_callable: env.task_graders:medium_agent_grader
     score_range: [0.0, 1.0]
   - name: hard
+    description: "Most items and high volatility; noisy staleness signal and harder tradeoffs."
     difficulty: hard
     max_steps: 10
     grader: true
+    grader_kind: programmatic
+    grader_callable: env.task_graders:hard_agent_grader
     score_range: [0.0, 1.0]
 endpoints:
   reset: POST /reset
   step: POST /step
   state: GET /state
+  schema: GET /schema
+  metadata: GET /metadata
+  health: GET /health
   tasks: GET /tasks

pyproject.toml CHANGED Viewed

@@ -17,6 +17,9 @@ dependencies = [
     "python-dotenv>=1.0.0",
 ]
 [project.scripts]
 server = "server.app:main"

     "python-dotenv>=1.0.0",
 ]
+[project.optional-dependencies]
+dev = ["pytest>=8.0"]
 [project.scripts]
 server = "server.app:main"

server/app.py CHANGED Viewed

@@ -1,12 +1,59 @@
-"""OpenEnv entry: validator requires server/app.py with def main(...) and if __name__ + main()."""
 import uvicorn
-def main(host: str = "0.0.0.0", port: int = 7860):
-    from app import app as fastapi_app
-    uvicorn.run(fastapi_app, host=host, port=port)
 if __name__ == "__main__":

+"""OpenEnv FastAPI server: full HTTPEnvServer + task/grader discovery routes."""
+from __future__ import annotations
+import os
+from typing import Optional
 import uvicorn
+from openenv.core.env_server import create_fastapi_app
+from env.cache_environment import CacheInvalidationEnvironment
+from env.models import CacheAction, CacheObservation
+from env.task_graders import TASK_AGENT_GRADERS
+from env.tasks import TASK_MANIFEST, list_graders
+_singleton: CacheInvalidationEnvironment | None = None
+def _env_factory() -> CacheInvalidationEnvironment:
+    global _singleton
+    if _singleton is None:
+        _singleton = CacheInvalidationEnvironment()
+    return _singleton
+app = create_fastapi_app(
+    _env_factory,
+    CacheAction,
+    CacheObservation,
+    max_concurrent_envs=1,
+)
+@app.get(
+    "/tasks",
+    tags=["Environment Info"],
+    summary="List tasks and grader registration",
+)
+def http_list_tasks():
+    return {
+        "tasks": TASK_MANIFEST,
+        "graders": list_graders(),
+        "grader_registry": {
+            name: {
+                "enabled": True,
+                "qualified_name": f"{fn.__module__}:{fn.__name__}",
+            }
+            for name, fn in TASK_AGENT_GRADERS.items()
+        },
+    }
+def main(host: Optional[str] = None, port: Optional[int] = None) -> None:
+    host = host or os.environ.get("HOST", "0.0.0.0")
+    port = int(port or os.environ.get("PORT", "7860"))
+    uvicorn.run(app, host=host, port=port)
 if __name__ == "__main__":

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,10 @@

+import pytest
+@pytest.fixture(autouse=True)
+def reset_env_singleton():
+    import server.app as sa
+    sa._singleton = None
+    yield
+    sa._singleton = None

tests/test_phase1.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""Phase 1 gates: OpenEnv HTTP, three tasks, graders in [0,1], reproducible seed."""
+import pytest
+from fastapi.testclient import TestClient
+from env.grader import clamp_unit_interval, evaluate_episode
+from env.task_graders import TASK_AGENT_GRADERS
+from server.app import app
+@pytest.fixture
+def client():
+    return TestClient(app)
+def test_tasks_endpoint_three_graders(client):
+    r = client.get("/tasks")
+    assert r.status_code == 200
+    data = r.json()
+    assert len(data["tasks"]) >= 3
+    enabled = [t for t in data["tasks"] if t.get("grader")]
+    assert len(enabled) >= 3
+    assert len(data["grader_registry"]) >= 3
+def test_each_task_grader_returns_unit_interval():
+    history = [
+        {"action": "keep", "is_stale": False},
+        {"action": "invalidate", "is_stale": True},
+    ]
+    for name, fn in TASK_AGENT_GRADERS.items():
+        s = fn(history)
+        assert 0.0 <= s <= 1.0, (name, s)
+def test_reset_step_openenv_shape(client):
+    r = client.post("/reset", json={"seed": 123, "task_id": "medium"})
+    assert r.status_code == 200
+    body = r.json()
+    assert set(body.keys()) >= {"observation", "reward", "done"}
+    obs = body["observation"]
+    assert obs["task_id"] == "medium"
+    key = obs["items"][0]["key"]
+    s = client.post("/step", json={"action": {"type": "keep", "key": key}})
+    assert s.status_code == 200
+    assert "observation" in s.json()
+def test_reproducible_reset_seed(client):
+    a = client.post("/reset", json={"seed": 999, "task_id": "easy"}).json()["observation"]
+    b = client.post("/reset", json={"seed": 999, "task_id": "easy"}).json()["observation"]
+    assert a["items"] == b["items"]
+def test_final_score_in_range(client):
+    r = client.post("/reset", json={"seed": 0, "task_id": "easy"})
+    obs = r.json()["observation"]
+    final = None
+    for _ in range(12):
+        k = obs["items"][0]["key"]
+        d = client.post("/step", json={"action": {"type": "keep", "key": k}}).json()
+        obs = d["observation"]
+        if obs.get("final_score") is not None:
+            final = obs["final_score"]
+            break
+    assert final is not None
+    assert 0.0 <= final <= 1.0
+def test_clamp_unit_interval():
+    assert clamp_unit_interval(-1) == 0.0
+    assert clamp_unit_interval(2) == 1.0
+    assert evaluate_episode([]) == 0.0

uv.lock CHANGED Viewed

@@ -234,16 +234,23 @@ dependencies = [
     { name = "uvicorn", extra = ["standard"] },
 ]
 [package.metadata]
 requires-dist = [
     { name = "fastapi", specifier = ">=0.100.0" },
     { name = "openai", specifier = ">=1.0.0" },
     { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
     { name = "pydantic", specifier = ">=2.0.0" },
     { name = "python-dotenv", specifier = ">=1.0.0" },
     { name = "requests", specifier = ">=2.28.0" },
     { name = "uvicorn", extras = ["standard"], specifier = ">=0.22.0" },
 ]
 [[package]]
 name = "cachetools"
@@ -956,6 +963,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/fa/5e/f8e9a1d23b9c20a551a8a02ea3637b4642e22c2626e3a13a9a29cdea99eb/importlib_metadata-8.7.1-py3-none-any.whl", hash = "sha256:5a1f80bf1daa489495071efbb095d75a634cf28a8bc299581244063b53176151", size = 27865, upload-time = "2025-12-21T10:00:18.329Z" },
 ]
 [[package]]
 name = "jaraco-classes"
 version = "3.4.0"
@@ -1893,6 +1909,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/63/d7/97f7e3a6abb67d8080dd406fd4df842c2be0efaf712d1c899c32a075027c/platformdirs-4.9.4-py3-none-any.whl", hash = "sha256:68a9a4619a666ea6439f2ff250c12a853cd1cbd5158d258bd824a7df6be2f868", size = 21216, upload-time = "2026-03-05T18:34:12.172Z" },
 ]
 [[package]]
 name = "py-key-value-aio"
 version = "0.4.4"
@@ -2123,6 +2148,24 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/df/80/fc9d01d5ed37ba4c42ca2b55b4339ae6e200b456be3a1aaddf4a9fa99b8c/pyperclip-1.11.0-py3-none-any.whl", hash = "sha256:299403e9ff44581cb9ba2ffeed69c7aa96a008622ad0c46cb575ca75b5b84273", size = 11063, upload-time = "2025-09-26T14:40:36.069Z" },
 ]
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"

     { name = "uvicorn", extra = ["standard"] },
 ]
+[package.optional-dependencies]
+dev = [
+    { name = "pytest" },
+]
 [package.metadata]
 requires-dist = [
     { name = "fastapi", specifier = ">=0.100.0" },
     { name = "openai", specifier = ">=1.0.0" },
     { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
     { name = "pydantic", specifier = ">=2.0.0" },
+    { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0" },
     { name = "python-dotenv", specifier = ">=1.0.0" },
     { name = "requests", specifier = ">=2.28.0" },
     { name = "uvicorn", extras = ["standard"], specifier = ">=0.22.0" },
 ]
+provides-extras = ["dev"]
 [[package]]
 name = "cachetools"
     { url = "https://files.pythonhosted.org/packages/fa/5e/f8e9a1d23b9c20a551a8a02ea3637b4642e22c2626e3a13a9a29cdea99eb/importlib_metadata-8.7.1-py3-none-any.whl", hash = "sha256:5a1f80bf1daa489495071efbb095d75a634cf28a8bc299581244063b53176151", size = 27865, upload-time = "2025-12-21T10:00:18.329Z" },
 ]
+[[package]]
+name = "iniconfig"
+version = "2.3.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" },
+]
 [[package]]
 name = "jaraco-classes"
 version = "3.4.0"
     { url = "https://files.pythonhosted.org/packages/63/d7/97f7e3a6abb67d8080dd406fd4df842c2be0efaf712d1c899c32a075027c/platformdirs-4.9.4-py3-none-any.whl", hash = "sha256:68a9a4619a666ea6439f2ff250c12a853cd1cbd5158d258bd824a7df6be2f868", size = 21216, upload-time = "2026-03-05T18:34:12.172Z" },
 ]
+[[package]]
+name = "pluggy"
+version = "1.6.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
+]
 [[package]]
 name = "py-key-value-aio"
 version = "0.4.4"
     { url = "https://files.pythonhosted.org/packages/df/80/fc9d01d5ed37ba4c42ca2b55b4339ae6e200b456be3a1aaddf4a9fa99b8c/pyperclip-1.11.0-py3-none-any.whl", hash = "sha256:299403e9ff44581cb9ba2ffeed69c7aa96a008622ad0c46cb575ca75b5b84273", size = 11063, upload-time = "2025-09-26T14:40:36.069Z" },
 ]
+[[package]]
+name = "pytest"
+version = "9.0.3"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+    { name = "exceptiongroup", marker = "python_full_version < '3.11'" },
+    { name = "iniconfig" },
+    { name = "packaging" },
+    { name = "pluggy" },
+    { name = "pygments" },
+    { name = "tomli", marker = "python_full_version < '3.11'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/7d/0d/549bd94f1a0a402dc8cf64563a117c0f3765662e2e668477624baeec44d5/pytest-9.0.3.tar.gz", hash = "sha256:b86ada508af81d19edeb213c681b1d48246c1a91d304c6c81a427674c17eb91c", size = 1572165, upload-time = "2026-04-07T17:16:18.027Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d4/24/a372aaf5c9b7208e7112038812994107bc65a84cd00e0354a88c2c77a617/pytest-9.0.3-py3-none-any.whl", hash = "sha256:2c5efc453d45394fdd706ade797c0a81091eccd1d6e4bccfcd476e2b8e0ab5d9", size = 375249, upload-time = "2026-04-07T17:16:16.13Z" },
+]
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"