Parv Pareek commited on
Commit
e75c8ce
·
1 Parent(s): 351158b
README.md CHANGED
@@ -15,54 +15,139 @@ pinned: false
15
 
16
  **Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
17
 
18
- **Our approach:** We simulate several cache **items** per episode. Each item has hidden staleness dynamics (TTL, update rate). The API only exposes **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks an action **per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **grader** produces a **final score in [0, 1]** from correctness, wasted invalidations, and stability.
19
 
20
- **Tasks:** Three difficulties — **easy**, **medium**, **hard** — differ by number of items and how volatile hidden state is, so the same policy can be compared across noise levels.
21
 
22
  ---
23
 
24
- ## API (OpenEnv-style HTTP)
25
 
26
- | Method | Path | Role |
27
- |--------|------|------|
28
- | POST | `/reset` | New episode; returns `state` and `task_id` |
29
- | POST | `/step` | JSON body `{"type":"keep\|refresh\|invalidate","key":"item_0"}`; returns `state`, `reward`, `done`, optional `final_score` when episode ends |
30
- | GET | `/state` | Current observation |
 
31
 
32
- **Deployed Space (example):** `https://parvpareek-cache-env.hf.space` ping with:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ```bash
35
  curl -s -o /dev/null -w '%{http_code}\n' -X POST \
36
  -H 'Content-Type: application/json' -d '{}' \
37
- 'https://parvpareek-cache-env.hf.space/reset'
38
  ```
39
 
40
  Expect `200`.
41
 
42
- **Local run:** `pip install -r requirements.txt` then `uvicorn app:app --host 0.0.0.0 --port 7860` (or use the Dockerfile).
43
 
44
  ---
45
 
46
  ## Baseline inference (`inference.py`)
47
 
48
- - Uses the **OpenAI Python client** with **`API_BASE_URL`**, **`MODEL_NAME`**, and **`HF_TOKEN`** (set as environment variables or in a local `.env` loaded by `inference.py`; never commit tokens).
49
- - Talks to the **Space URL** above (override with `ENV_URL` if needed).
50
- - Prints exactly **`[START]`**, one **`[STEP]`** per env step, and **`[END]`** with `score` and `rewards` as required by the challenge spec.
51
-
52
- Run:
53
 
54
  ```bash
55
- export API_BASE_URL='https://router.huggingface.co/v1'
56
- export MODEL_NAME='<model your account can call>'
57
- export HF_TOKEN='hf_...'
58
  python inference.py
 
 
 
59
  ```
60
 
61
  ---
62
 
63
- ## Validation (pre-submission)
 
 
 
 
64
 
65
- From the repo root:
 
 
 
 
66
 
67
  ```bash
68
  openenv validate
@@ -72,35 +157,31 @@ docker build .
72
 
73
  ---
74
 
75
- ## Repository layout (high level)
76
 
77
  | Path | Purpose |
78
  |------|---------|
79
- | `app.py` | FastAPI app: `/reset`, `/step`, `/state` |
80
- | `env/` | Environment logic, tasks, grading, generation |
81
- | `openenv.yaml` | OpenEnv metadata |
82
- | `inference.py` | Baseline agent + structured logs |
83
- | `Dockerfile` | Space / CI image |
84
- | `pyproject.toml`, `uv.lock`, `server/app.py` | `openenv validate` / multi-mode layout |
 
 
 
 
 
85
 
86
  ---
87
 
88
- ## Scoring (short)
89
 
90
- - **Per-step reward:** Shaped table (e.g. invalidate when stale is good; invalidate when fresh is penalized). Values can be negative in the middle of an episode.
91
- - **Episode `final_score` (when `done`):** Normalized grader in **[0, 1]** combining decision quality, unnecessary invalidations, and oscillation.
92
 
93
  ---
94
 
95
- ## Summary
96
-
97
- | Criterion | Status |
98
- |-----------|--------|
99
- | Real-world task (not a toy game) | Cache invalidation under uncertainty |
100
- | `reset` / `step` / `state` | Implemented |
101
- | `openenv.yaml` | Present |
102
- | 3 tasks + grader | `easy` / `medium` / `hard` |
103
- | Meaningful rewards | Dense step reward + episode score in [0, 1] |
104
- | Baseline | `inference.py` + OpenAI client + stdout format |
105
 
106
- If anything fails in automated checks, compare your **Space app URL** (`*.hf.space`) and **pushed commit** to what you submit.
 
15
 
16
  **Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
17
 
18
+ **Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**.
19
 
20
+ **Tasks:** **easy medium hard** — more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**.
21
 
22
  ---
23
 
24
+ ## OpenEnv spec compliance
25
 
26
+ - **Typed models:** `env/models.py` `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
27
+ - **Environment:** `env/cache_environment.py` — `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
28
+ - **HTTP server:** `server/app.py` `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery.
29
+ - **Manifest:** `openenv.yaml` `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
30
+ - **Client (WebSocket):** `env/client.py` `CacheInvalidationEnvClient` for typed `EnvClient` usage.
31
+ - **Shim:** `app.py` re-exports `app` for `uvicorn app:app`.
32
 
33
+ Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default).
34
+
35
+ ---
36
+
37
+ ## Action & observation
38
+
39
+ **Action (POST `/step` body, OpenEnv wrapped form):**
40
+
41
+ ```json
42
+ {
43
+ "action": {
44
+ "type": "invalidate",
45
+ "key": "item_0"
46
+ }
47
+ }
48
+ ```
49
+
50
+ `type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.
51
+
52
+ **Reset (POST `/reset`):**
53
+
54
+ ```json
55
+ {
56
+ "seed": 42,
57
+ "task_id": "easy"
58
+ }
59
+ ```
60
+
61
+ Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible.
62
+
63
+ **Response shape (reset & step):**
64
+
65
+ ```json
66
+ {
67
+ "observation": {
68
+ "items": [...],
69
+ "step": 0,
70
+ "task_id": "easy",
71
+ "final_score": null,
72
+ "done": false
73
+ },
74
+ "reward": 0.0,
75
+ "done": false
76
+ }
77
+ ```
78
+
79
+ When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**.
80
+
81
+ ---
82
+
83
+ ## Tasks and graders
84
+
85
+ - **Registry:** `env/task_graders.py` — `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
86
+ - **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
87
+ - **Episode grader:** `env/grader.py` — `evaluate_episode` (freshness, unnecessary invalidations, oscillation).
88
+
89
+ ---
90
+
91
+ ## Setup & run
92
+
93
+ **Install (dev):**
94
+
95
+ ```bash
96
+ uv sync --extra dev
97
+ ```
98
+
99
+ **Local server:**
100
+
101
+ ```bash
102
+ uv run server
103
+ # or
104
+ uvicorn app:app --host 0.0.0.0 --port 7860
105
+ ```
106
+
107
+ **Health check:**
108
 
109
  ```bash
110
  curl -s -o /dev/null -w '%{http_code}\n' -X POST \
111
  -H 'Content-Type: application/json' -d '{}' \
112
+ 'http://127.0.0.1:7860/reset'
113
  ```
114
 
115
  Expect `200`.
116
 
117
+ **Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**).
118
 
119
  ---
120
 
121
  ## Baseline inference (`inference.py`)
122
 
123
+ - Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses.
124
+ - **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
125
+ - **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
126
+ - Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required).
 
127
 
128
  ```bash
129
+ export ENV_URL='http://127.0.0.1:7860' # or your Space https://....hf.space
130
+ export EPISODE_SEED=42
131
+ export TASK_ID=easy
132
  python inference.py
133
+
134
+ # Phase-1 style: one process, three tasks
135
+ RUN_ALL_TASKS=1 python inference.py
136
  ```
137
 
138
  ---
139
 
140
+ ## Tests (Phase 1 checks)
141
+
142
+ ```bash
143
+ uv run pytest tests/ -q
144
+ ```
145
 
146
+ Covers: `GET /tasks` (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.
147
+
148
+ ---
149
+
150
+ ## Validation (pre-submission)
151
 
152
  ```bash
153
  openenv validate
 
157
 
158
  ---
159
 
160
+ ## Repository layout
161
 
162
  | Path | Purpose |
163
  |------|---------|
164
+ | `env/models.py` | Typed Action / Observation / State |
165
+ | `env/cache_environment.py` | `Environment` implementation |
166
+ | `env/grader.py` | Step rewards + episode `evaluate_episode` |
167
+ | `env/task_graders.py` | **Three named agent graders** (registry) |
168
+ | `env/tasks.py` | Task configs + `TASK_MANIFEST` |
169
+ | `env/client.py` | Typed WebSocket `EnvClient` |
170
+ | `server/app.py` | `create_fastapi_app` + `/tasks` |
171
+ | `app.py` | Uvicorn entry shim |
172
+ | `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs |
173
+ | `openenv.yaml` | Full OpenEnv manifest |
174
+ | `tests/` | Phase 1 pytest |
175
 
176
  ---
177
 
178
+ ## Scoring
179
 
180
+ - **Per-step `reward`:** Shaped (can be negative mid-episode).
181
+ - **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability.
182
 
183
  ---
184
 
185
+ ## Resource notes
 
 
 
 
 
 
 
 
 
186
 
187
+ Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode × 3 tasks).
app.py CHANGED
@@ -1,39 +1,5 @@
1
- from fastapi import Body, FastAPI
2
- from pydantic import BaseModel, ConfigDict
3
- from env.core import CacheEnv
4
- from env.tasks import TASK_MANIFEST
5
 
6
- app = FastAPI()
7
- env = CacheEnv()
8
 
9
-
10
- class ResetBody(BaseModel):
11
- model_config = ConfigDict(extra="ignore")
12
- task_id: str | None = None
13
- task_name: str | None = None
14
-
15
-
16
- @app.post("/reset")
17
- def reset(body: ResetBody = Body(default_factory=ResetBody)):
18
- task_key = body.task_id or body.task_name
19
- state = env.reset(task_id=task_key)
20
- return {
21
- "state": state,
22
- "task_id": state.get("task_id"),
23
- }
24
-
25
-
26
- @app.get("/tasks")
27
- def list_tasks():
28
- """Hub validators use this to discover tasks that expose episode grading (final_score)."""
29
- return {"tasks": TASK_MANIFEST}
30
-
31
-
32
- @app.post("/step")
33
- def step(action: dict):
34
- return env.step(action)
35
-
36
-
37
- @app.get("/state")
38
- def state():
39
- return env.get_state()
 
1
+ """Shim for `uvicorn app:app` (Docker / local one-liners)."""
 
 
 
2
 
3
+ from server.app import app
 
4
 
5
+ __all__ = ["app"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cache_invalidation_env.egg-info/PKG-INFO CHANGED
@@ -10,3 +10,5 @@ Requires-Dist: pydantic>=2.0.0
10
  Requires-Dist: requests>=2.28.0
11
  Requires-Dist: openai>=1.0.0
12
  Requires-Dist: python-dotenv>=1.0.0
 
 
 
10
  Requires-Dist: requests>=2.28.0
11
  Requires-Dist: openai>=1.0.0
12
  Requires-Dist: python-dotenv>=1.0.0
13
+ Provides-Extra: dev
14
+ Requires-Dist: pytest>=8.0; extra == "dev"
cache_invalidation_env.egg-info/SOURCES.txt CHANGED
@@ -7,10 +7,13 @@ cache_invalidation_env.egg-info/entry_points.txt
7
  cache_invalidation_env.egg-info/requires.txt
8
  cache_invalidation_env.egg-info/top_level.txt
9
  env/__init__.py
10
- env/core.py
 
11
  env/generator.py
12
  env/grader.py
13
  env/models.py
 
14
  env/tasks.py
15
  server/__init__.py
16
- server/app.py
 
 
7
  cache_invalidation_env.egg-info/requires.txt
8
  cache_invalidation_env.egg-info/top_level.txt
9
  env/__init__.py
10
+ env/cache_environment.py
11
+ env/client.py
12
  env/generator.py
13
  env/grader.py
14
  env/models.py
15
+ env/task_graders.py
16
  env/tasks.py
17
  server/__init__.py
18
+ server/app.py
19
+ tests/test_phase1.py
cache_invalidation_env.egg-info/requires.txt CHANGED
@@ -5,3 +5,6 @@ pydantic>=2.0.0
5
  requests>=2.28.0
6
  openai>=1.0.0
7
  python-dotenv>=1.0.0
 
 
 
 
5
  requests>=2.28.0
6
  openai>=1.0.0
7
  python-dotenv>=1.0.0
8
+
9
+ [dev]
10
+ pytest>=8.0
env/__init__.py CHANGED
@@ -1 +1,13 @@
1
- # Cache invalidation environment package
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Cache invalidation OpenEnv package."""
2
+
3
+ from env.cache_environment import CacheInvalidationEnvironment
4
+ from env.client import CacheInvalidationEnvClient
5
+ from env.models import CacheAction, CacheObservation, CacheState
6
+
7
+ __all__ = [
8
+ "CacheAction",
9
+ "CacheObservation",
10
+ "CacheState",
11
+ "CacheInvalidationEnvironment",
12
+ "CacheInvalidationEnvClient",
13
+ ]
env/cache_environment.py ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OpenEnv Environment: cache invalidation under partial observability."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import random
6
+ from typing import Any, Optional
7
+
8
+ from openenv.core.env_server import Environment
9
+ from openenv.core.env_server.types import EnvironmentMetadata
10
+
11
+ from env.generator import generate_env
12
+ from env.grader import compute_step_reward, evaluate_episode
13
+ from env.models import CacheAction, CacheItem, CacheObservation, CacheState
14
+ from env.tasks import sample_task
15
+
16
+
17
+ class CacheInvalidationEnvironment(Environment[CacheAction, CacheObservation, CacheState]):
18
+ """Stateful cache control: invalidate, refresh, or keep per step (one key)."""
19
+
20
+ SUPPORTS_CONCURRENT_SESSIONS = False
21
+
22
+ def __init__(self) -> None:
23
+ super().__init__()
24
+ self._rng: random.Random | type[random] = random
25
+ self.history: list[dict[str, Any]] = []
26
+ self.task_id: str = "easy"
27
+ self.hidden: list[dict[str, Any]] = []
28
+ self.current_time: int = 0
29
+ self._items: list[dict[str, Any]] = []
30
+ self._step: int = 0
31
+
32
+ def reset(
33
+ self,
34
+ seed: Optional[int] = None,
35
+ episode_id: Optional[str] = None,
36
+ task_id: Optional[str] = None,
37
+ task_name: Optional[str] = None,
38
+ **kwargs: Any,
39
+ ) -> CacheObservation:
40
+ tid = task_id or task_name or kwargs.get("task_id") or kwargs.get("task_name")
41
+ self._reset_rubric()
42
+
43
+ if seed is not None:
44
+ self._rng = random.Random(int(seed))
45
+ else:
46
+ self._rng = random
47
+
48
+ self.history = []
49
+ if tid in ("easy", "medium", "hard"):
50
+ self.task_id = tid
51
+ else:
52
+ self.task_id = sample_task(self._rng)
53
+
54
+ items, hidden, current_time = generate_env(self.task_id, rng=self._rng)
55
+ self._items = items
56
+ self.hidden = hidden
57
+ self.current_time = current_time
58
+ self._step = 0
59
+
60
+ return self._observation(
61
+ reward=None,
62
+ done=False,
63
+ final_score=None,
64
+ )
65
+
66
+ def step(
67
+ self,
68
+ action: CacheAction,
69
+ timeout_s: Optional[float] = None,
70
+ **kwargs: Any,
71
+ ) -> CacheObservation:
72
+ key = action.key
73
+ action_type = action.type
74
+
75
+ item_index = next(
76
+ (i for i, x in enumerate(self._items) if x["key"] == key), None
77
+ )
78
+
79
+ if item_index is None:
80
+ return self._observation(reward=-1.0, done=True, final_score=None)
81
+
82
+ hidden = self.hidden[item_index]
83
+ item = self._items[item_index]
84
+
85
+ age = self.current_time - hidden["last_update"]
86
+ is_stale = age > hidden["base_ttl"] or self._rng.random() < hidden["update_freq"]
87
+
88
+ self.history.append({"action": action_type, "is_stale": is_stale})
89
+
90
+ reward = compute_step_reward(action_type, is_stale)
91
+
92
+ if action_type == "invalidate":
93
+ hidden["last_update"] = self.current_time
94
+ item["age"] = 0
95
+
96
+ elif action_type == "refresh":
97
+ hidden["last_update"] = self.current_time - 1
98
+ item["age"] = 1
99
+
100
+ elif action_type == "keep":
101
+ item["age"] += 1
102
+
103
+ item["last_result"] = (
104
+ "stale"
105
+ if is_stale and self._rng.random() < 0.7
106
+ else "hit"
107
+ if not is_stale or self._rng.random() < 0.9
108
+ else "stale"
109
+ )
110
+
111
+ self.current_time += 1
112
+ self._step += 1
113
+
114
+ done = self._step >= 10
115
+ final_score = evaluate_episode(self.history) if done else None
116
+
117
+ return self._observation(
118
+ reward=reward,
119
+ done=done,
120
+ final_score=final_score,
121
+ )
122
+
123
+ @property
124
+ def state(self) -> CacheState:
125
+ return CacheState(
126
+ episode_id=None,
127
+ step_count=self._step,
128
+ task_id=self.task_id,
129
+ items=[CacheItem.model_validate(x) for x in self._items],
130
+ )
131
+
132
+ def get_metadata(self) -> EnvironmentMetadata:
133
+ return EnvironmentMetadata(
134
+ name="cache_invalidation_env",
135
+ description=(
136
+ "Cache invalidation under uncertainty: choose invalidate, refresh, or keep "
137
+ "per step from noisy hit/stale observations."
138
+ ),
139
+ version="1.0.0",
140
+ )
141
+
142
+ def _observation(
143
+ self,
144
+ *,
145
+ reward: float | None,
146
+ done: bool,
147
+ final_score: float | None,
148
+ ) -> CacheObservation:
149
+ return CacheObservation(
150
+ done=done,
151
+ reward=reward,
152
+ items=[CacheItem.model_validate(x) for x in self._items],
153
+ step=self._step,
154
+ task_id=self.task_id,
155
+ final_score=final_score,
156
+ )
env/client.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Typed WebSocket client for CacheInvalidationEnvironment."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Dict
6
+
7
+ from openenv.core.client_types import StepResult
8
+ from openenv.core.env_client import EnvClient
9
+
10
+ from env.models import CacheAction, CacheObservation, CacheState
11
+
12
+
13
+ class CacheInvalidationEnvClient(EnvClient[CacheAction, CacheObservation, CacheState]):
14
+ def _step_payload(self, action: CacheAction | Dict[str, Any]) -> Dict[str, Any]:
15
+ if isinstance(action, CacheAction):
16
+ return action.model_dump()
17
+ return CacheAction.model_validate(action).model_dump()
18
+
19
+ def _parse_result(self, payload: Dict[str, Any]) -> StepResult[CacheObservation]:
20
+ obs_inner = payload.get("observation", {})
21
+ return StepResult(
22
+ observation=CacheObservation.model_validate(
23
+ {**obs_inner, "reward": payload.get("reward"), "done": payload.get("done", False)}
24
+ ),
25
+ reward=payload.get("reward"),
26
+ done=payload.get("done", False),
27
+ )
28
+
29
+ def _parse_state(self, payload: Dict[str, Any]) -> CacheState:
30
+ return CacheState.model_validate(payload)
env/core.py DELETED
@@ -1,91 +0,0 @@
1
- import random
2
- from env.generator import generate_env
3
- from env.grader import compute_step_reward
4
- from env.tasks import sample_task
5
- class CacheEnv:
6
-
7
- def __init__(self):
8
- self.reset()
9
-
10
- def reset(self, task_id=None):
11
- self.history = []
12
- if task_id in ("easy", "medium", "hard"):
13
- self.task_id = task_id
14
- else:
15
- self.task_id = sample_task()
16
- items, hidden, current_time = generate_env(self.task_id)
17
-
18
- self.state = {
19
- "items": items,
20
- "step": 0,
21
- "task_id": self.task_id
22
- }
23
-
24
- self.hidden = hidden
25
- self.current_time = current_time
26
- self.total_reward = 0
27
-
28
- return self.state
29
-
30
- def step(self, action):
31
- key = action.get("key")
32
- action_type = action.get("type")
33
-
34
- item_index = next((i for i, x in enumerate(self.state["items"]) if x["key"] == key), None)
35
-
36
- if item_index is None:
37
- return {"state": self.state, "reward": -1.0, "done": True}
38
-
39
- hidden = self.hidden[item_index]
40
- item = self.state["items"][item_index]
41
-
42
- # hidden staleness
43
- age = self.current_time - hidden["last_update"]
44
- is_stale = age > hidden["base_ttl"] or random.random() < hidden["update_freq"]
45
-
46
- self.history.append({
47
- "action": action_type,
48
- "is_stale": is_stale
49
- })
50
-
51
- reward = compute_step_reward(action_type, is_stale)
52
- self.total_reward += reward
53
-
54
- # apply action
55
- if action_type == "invalidate":
56
- hidden["last_update"] = self.current_time
57
- item["age"] = 0
58
-
59
- elif action_type == "refresh":
60
- hidden["last_update"] = self.current_time - 1
61
- item["age"] = 1
62
-
63
- elif action_type == "keep":
64
- item["age"] += 1
65
-
66
- # noisy observation
67
- item["last_result"] = (
68
- "stale" if is_stale and random.random() < 0.7
69
- else "hit" if not is_stale or random.random() < 0.9
70
- else "stale"
71
- )
72
-
73
- self.current_time += 1
74
- self.state["step"] += 1
75
-
76
- done = self.state["step"] >= 10
77
- from env.grader import evaluate_episode
78
-
79
- if done:
80
- final_score = evaluate_episode(self.history)
81
- else:
82
- final_score = None
83
- return {
84
- "state": self.state,
85
- "reward": reward,
86
- "done": done,
87
- "task_id": self.task_id,
88
- "final_score": final_score
89
- }
90
- def get_state(self):
91
- return self.state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
env/generator.py CHANGED
@@ -1,7 +1,10 @@
1
  import random
2
  from env.tasks import get_task
3
 
4
- def generate_env(task_id):
 
 
 
5
  config = get_task(task_id)
6
 
7
  state_items = []
@@ -10,27 +13,31 @@ def generate_env(task_id):
10
  current_time = 0
11
 
12
  for i in range(config["num_items"]):
13
- base_ttl = random.randint(3, 8)
14
- update_freq = random.uniform(0.1, config["volatility"])
15
- last_update = random.randint(0, 3)
16
 
17
  age = current_time - last_update
18
 
19
- is_stale = age > base_ttl or random.random() < update_freq
20
 
21
- last_result = "stale" if is_stale and random.random() < 0.7 else "hit"
22
 
23
- state_items.append({
24
- "key": f"item_{i}",
25
- "age": max(age, 0),
26
- "access_count": random.randint(1, 20),
27
- "last_result": last_result
28
- })
 
 
29
 
30
- hidden_items.append({
31
- "base_ttl": base_ttl,
32
- "update_freq": update_freq,
33
- "last_update": last_update
34
- })
 
 
35
 
36
- return state_items, hidden_items, current_time
 
1
  import random
2
  from env.tasks import get_task
3
 
4
+
5
+ def generate_env(task_id, rng=None):
6
+ """Build initial items and hidden dynamics. Use *rng* for reproducible episodes."""
7
+ r = rng if rng is not None else random
8
  config = get_task(task_id)
9
 
10
  state_items = []
 
13
  current_time = 0
14
 
15
  for i in range(config["num_items"]):
16
+ base_ttl = r.randint(3, 8)
17
+ update_freq = r.uniform(0.1, config["volatility"])
18
+ last_update = r.randint(0, 3)
19
 
20
  age = current_time - last_update
21
 
22
+ is_stale = age > base_ttl or r.random() < update_freq
23
 
24
+ last_result = "stale" if is_stale and r.random() < 0.7 else "hit"
25
 
26
+ state_items.append(
27
+ {
28
+ "key": f"item_{i}",
29
+ "age": max(age, 0),
30
+ "access_count": r.randint(1, 20),
31
+ "last_result": last_result,
32
+ }
33
+ )
34
 
35
+ hidden_items.append(
36
+ {
37
+ "base_ttl": base_ttl,
38
+ "update_freq": update_freq,
39
+ "last_update": last_update,
40
+ }
41
+ )
42
 
43
+ return state_items, hidden_items, current_time
env/grader.py CHANGED
@@ -1,9 +1,6 @@
1
- # Submission validators require final scores strictly in (0, 1), not at the endpoints.
2
- _SCORE_EPS = 1e-4
3
-
4
-
5
- def clamp_strict_unit_interval(x: float) -> float:
6
- return float(min(1.0 - _SCORE_EPS, max(_SCORE_EPS, x)))
7
 
8
 
9
  def compute_step_reward(action_type, is_stale):
@@ -20,11 +17,10 @@ def compute_step_reward(action_type, is_stale):
20
 
21
  return reward
22
 
 
23
  def normalize_episode_score(total_reward, max_steps=10):
24
- # expected max ≈ 1.0 per step
25
  score = total_reward / max_steps
26
- return clamp_strict_unit_interval(max(0.0, min(1.0, score)))
27
-
28
 
29
 
30
  def evaluate_episode(history):
@@ -35,11 +31,10 @@ def evaluate_episode(history):
35
  "is_stale": bool
36
  }
37
  """
38
-
39
  total_steps = len(history)
40
 
41
  if total_steps == 0:
42
- return clamp_strict_unit_interval(0.0)
43
 
44
  correct_decisions = 0
45
  unnecessary_invalidations = 0
@@ -51,33 +46,23 @@ def evaluate_episode(history):
51
  action = step["action"]
52
  is_stale = step["is_stale"]
53
 
54
- # correctness (freshness proxy)
55
- if (is_stale and action in ["invalidate", "refresh"]) or \
56
- (not is_stale and action == "keep"):
57
  correct_decisions += 1
58
 
59
- # ❌ unnecessary invalidation
60
  if action == "invalidate" and not is_stale:
61
  unnecessary_invalidations += 1
62
 
63
- # ❌ oscillation (flip behavior)
64
  if last_action and last_action != action:
65
  oscillations += 1
66
 
67
  last_action = action
68
 
69
- # ---- normalize metrics ----
70
  freshness = correct_decisions / total_steps
71
-
72
  efficiency = 1 - (unnecessary_invalidations / total_steps)
73
-
74
  stability = 1 - (oscillations / total_steps)
75
 
76
- # ---- weighted score ----
77
- score = (
78
- 0.5 * freshness +
79
- 0.3 * efficiency +
80
- 0.2 * stability
81
- )
82
 
83
- return clamp_strict_unit_interval(max(0.0, min(1.0, score)))
 
1
+ def clamp_unit_interval(x: float) -> float:
2
+ """Clamp to [0.0, 1.0] (Phase 1 / rubric)."""
3
+ return max(0.0, min(1.0, float(x)))
 
 
 
4
 
5
 
6
  def compute_step_reward(action_type, is_stale):
 
17
 
18
  return reward
19
 
20
+
21
  def normalize_episode_score(total_reward, max_steps=10):
 
22
  score = total_reward / max_steps
23
+ return clamp_unit_interval(score)
 
24
 
25
 
26
  def evaluate_episode(history):
 
31
  "is_stale": bool
32
  }
33
  """
 
34
  total_steps = len(history)
35
 
36
  if total_steps == 0:
37
+ return clamp_unit_interval(0.0)
38
 
39
  correct_decisions = 0
40
  unnecessary_invalidations = 0
 
46
  action = step["action"]
47
  is_stale = step["is_stale"]
48
 
49
+ if (is_stale and action in ["invalidate", "refresh"]) or (
50
+ not is_stale and action == "keep"
51
+ ):
52
  correct_decisions += 1
53
 
 
54
  if action == "invalidate" and not is_stale:
55
  unnecessary_invalidations += 1
56
 
 
57
  if last_action and last_action != action:
58
  oscillations += 1
59
 
60
  last_action = action
61
 
 
62
  freshness = correct_decisions / total_steps
 
63
  efficiency = 1 - (unnecessary_invalidations / total_steps)
 
64
  stability = 1 - (oscillations / total_steps)
65
 
66
+ score = 0.5 * freshness + 0.3 * efficiency + 0.2 * stability
 
 
 
 
 
67
 
68
+ return clamp_unit_interval(score)
env/models.py CHANGED
@@ -1,16 +1,43 @@
1
- from pydantic import BaseModel
2
- from typing import List
 
 
 
 
 
 
 
3
 
4
  class CacheItem(BaseModel):
 
 
5
  key: str
6
- age: int
7
- access_count: int
8
- last_result: str # "hit" or "stale"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
- class State(BaseModel):
11
- items: List[CacheItem]
12
- step: int
13
 
14
- class Action(BaseModel):
15
- type: str # invalidate | keep | refresh
16
- key: str
 
1
+ """Typed OpenEnv contracts: Action, Observation, State."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Literal
6
+
7
+ from openenv.core.env_server import Action, Observation, State
8
+ from pydantic import BaseModel, ConfigDict, Field
9
+
10
 
11
  class CacheItem(BaseModel):
12
+ model_config = ConfigDict(extra="allow")
13
+
14
  key: str
15
+ age: int = Field(ge=0)
16
+ access_count: int = Field(ge=0)
17
+ last_result: str
18
+
19
+
20
+ class CacheAction(Action):
21
+ """Per-step decision for one cache key."""
22
+
23
+ type: Literal["invalidate", "refresh", "keep"]
24
+ key: str
25
+
26
+
27
+ class CacheObservation(Observation):
28
+ """What the agent sees (no hidden TTL / true staleness)."""
29
+
30
+ items: list[CacheItem] = Field(default_factory=list)
31
+ step: int = Field(default=0, ge=0)
32
+ task_id: str = ""
33
+ final_score: float | None = Field(
34
+ default=None,
35
+ description="Episode grader output in [0,1] when done=True; else None.",
36
+ )
37
+
38
 
39
+ class CacheState(State):
40
+ """Server-visible state (no hidden dynamics)."""
 
41
 
42
+ task_id: str = ""
43
+ items: list[CacheItem] = Field(default_factory=list)
 
env/task_graders.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Registered agent graders — one enabled grader per task (easy / medium / hard).
3
+
4
+ Automated checks count tasks that declare a grader and can run episode scoring.
5
+ All three share the same history-based rubric; difficulty is enforced by the
6
+ environment dynamics (items + volatility), not by different formulas.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ from typing import Any, Callable, Dict, List
12
+
13
+ from env.grader import evaluate_episode
14
+
15
+ History = List[Dict[str, Any]]
16
+
17
+
18
+ def easy_agent_grader(history: History) -> float:
19
+ return evaluate_episode(history)
20
+
21
+
22
+ def medium_agent_grader(history: History) -> float:
23
+ return evaluate_episode(history)
24
+
25
+
26
+ def hard_agent_grader(history: History) -> float:
27
+ return evaluate_episode(history)
28
+
29
+
30
+ # Explicit registry (imported by server /tasks and static analysis)
31
+ TASK_AGENT_GRADERS: Dict[str, Callable[[History], float]] = {
32
+ "easy": easy_agent_grader,
33
+ "medium": medium_agent_grader,
34
+ "hard": hard_agent_grader,
35
+ }
env/tasks.py CHANGED
@@ -1,6 +1,8 @@
1
  import random
2
 
3
- # Declared for GET /tasks and openenv.yaml (submission validators count tasks with graders).
 
 
4
  TASK_MANIFEST = [
5
  {
6
  "name": "easy",
@@ -10,6 +12,8 @@ TASK_MANIFEST = [
10
  "difficulty": "easy",
11
  "max_steps": 10,
12
  "grader": True,
 
 
13
  "score_range": [0.0, 1.0],
14
  },
15
  {
@@ -20,6 +24,8 @@ TASK_MANIFEST = [
20
  "difficulty": "medium",
21
  "max_steps": 10,
22
  "grader": True,
 
 
23
  "score_range": [0.0, 1.0],
24
  },
25
  {
@@ -30,6 +36,8 @@ TASK_MANIFEST = [
30
  "difficulty": "hard",
31
  "max_steps": 10,
32
  "grader": True,
 
 
33
  "score_range": [0.0, 1.0],
34
  },
35
  ]
@@ -47,8 +55,22 @@ def get_task(task_id):
47
  else:
48
  return {
49
  "num_items": 3,
50
- "volatility": 0.3
51
  }
52
 
53
- def sample_task():
54
- return random.choice(["easy", "medium", "hard"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import random
2
 
3
+ from env.task_graders import TASK_AGENT_GRADERS
4
+
5
+ # Declared for GET /tasks + openenv.yaml (Phase 1 task/grader discovery).
6
  TASK_MANIFEST = [
7
  {
8
  "name": "easy",
 
12
  "difficulty": "easy",
13
  "max_steps": 10,
14
  "grader": True,
15
+ "grader_kind": "programmatic",
16
+ "grader_callable": "env.task_graders:easy_agent_grader",
17
  "score_range": [0.0, 1.0],
18
  },
19
  {
 
24
  "difficulty": "medium",
25
  "max_steps": 10,
26
  "grader": True,
27
+ "grader_kind": "programmatic",
28
+ "grader_callable": "env.task_graders:medium_agent_grader",
29
  "score_range": [0.0, 1.0],
30
  },
31
  {
 
36
  "difficulty": "hard",
37
  "max_steps": 10,
38
  "grader": True,
39
+ "grader_kind": "programmatic",
40
+ "grader_callable": "env.task_graders:hard_agent_grader",
41
  "score_range": [0.0, 1.0],
42
  },
43
  ]
 
55
  else:
56
  return {
57
  "num_items": 3,
58
+ "volatility": 0.3,
59
  }
60
 
61
+
62
+ def sample_task(rng=None):
63
+ r = rng if rng is not None else random
64
+ return r.choice(["easy", "medium", "hard"])
65
+
66
+
67
+ def list_graders():
68
+ """Return task ids that have an enabled agent grader."""
69
+ return [
70
+ {
71
+ "task": name,
72
+ "grader_enabled": fn is not None,
73
+ "callable": getattr(fn, "__name__", str(fn)),
74
+ }
75
+ for name, fn in TASK_AGENT_GRADERS.items()
76
+ ]
inference.py CHANGED
@@ -3,14 +3,13 @@ import os
3
  import sys
4
  import textwrap
5
  from pathlib import Path
6
- from typing import List, Optional
7
 
8
  import requests
9
  from openai import OpenAI
10
 
11
- from env.grader import clamp_strict_unit_interval
12
 
13
- # Load .env from repo root so HF_TOKEN / API_BASE_URL work when you run: python inference.py
14
  try:
15
  from dotenv import load_dotenv
16
 
@@ -18,33 +17,36 @@ try:
18
  except ImportError:
19
  pass
20
 
21
- # ---- Mandatory env (see hackathon spec) ----
22
  API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
23
  API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
24
- # HF deprecated api-inference.huggingface.co (410); router is the supported OpenAI-compatible host.
25
-
26
  MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
27
-
28
- ENV_URL = os.getenv("ENV_URL", "https://parvpareek-cache-env.hf.space")
 
 
29
  BENCHMARK = "cache_invalidation_env"
30
 
 
 
 
 
31
  if not API_KEY:
32
  print(
33
- "WARNING: HF_TOKEN is not set. LLM calls will fail; the script will fall back to the "
34
- "heuristic policy. Set HF_TOKEN in the environment or in a .env file next to inference.py.",
35
  file=sys.stderr,
36
  )
37
 
38
  client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf-invalid")
39
 
40
- MEMORY = {}
41
- LAST_USED = None
42
 
43
  SYSTEM_PROMPT = textwrap.dedent(
44
  """
45
- You are a cache invalidation agent. Given the environment state (JSON), reply with exactly one JSON object
46
  on a single line, no markdown, with keys "type" and "key". type must be one of: invalidate, refresh, keep.
47
- key must match one of the item keys in state["items"].
48
  """
49
  ).strip()
50
 
@@ -72,11 +74,11 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
72
  )
73
 
74
 
75
- def select_item(state, step):
76
  global LAST_USED
77
- items = state["items"]
78
 
79
- def score(item):
80
  s = 0
81
  if item["last_result"] == "stale":
82
  s += 3
@@ -98,7 +100,7 @@ def select_item(state, step):
98
  return best
99
 
100
 
101
- def decide(item, step):
102
  key = item["key"]
103
  last_result = item["last_result"]
104
  age = item["age"]
@@ -123,8 +125,7 @@ def decide(item, step):
123
  return {"type": "keep", "key": key}
124
 
125
 
126
- def llm_action(state) -> Optional[dict]:
127
- """Call HF OpenAI-compatible API; return None on any failure so caller can fall back."""
128
  try:
129
  completion = client.chat.completions.create(
130
  model=MODEL_NAME,
@@ -133,7 +134,7 @@ def llm_action(state) -> Optional[dict]:
133
  {
134
  "role": "user",
135
  "content": (
136
- f"State:\n{json.dumps(state)}\n\n"
137
  'Return JSON only: {"type": "...", "key": "..."}'
138
  ),
139
  },
@@ -156,7 +157,8 @@ def llm_action(state) -> Optional[dict]:
156
  return None
157
 
158
 
159
- def run() -> None:
 
160
  global LAST_USED
161
  LAST_USED = None
162
  MEMORY.clear()
@@ -165,26 +167,28 @@ def run() -> None:
165
  steps_taken = 0
166
  episode_score = 0.0
167
  success = False
 
168
 
169
  try:
170
- score_from_env = False
171
  res = requests.post(
172
- f"{ENV_URL}/reset",
173
- json={},
174
  headers={"Content-Type": "application/json"},
175
  timeout=60,
176
  )
177
  res.raise_for_status()
178
  body = res.json()
179
- state = body.get("state", body)
180
- task_id = str(body.get("task_id", "unknown"))
181
 
182
- log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
183
 
184
  for step in range(1, 11):
185
- item = select_item(state, step)
186
 
187
- action = llm_action(state)
 
 
188
  if action is None:
189
  action = decide(item, step)
190
 
@@ -194,21 +198,22 @@ def run() -> None:
194
  }
195
 
196
  step_res = requests.post(
197
- f"{ENV_URL}/step",
198
- json=action,
199
  headers={"Content-Type": "application/json"},
200
  timeout=60,
201
  )
202
  step_res.raise_for_status()
203
  data = step_res.json()
204
 
205
- reward = float(data["reward"])
206
  done = bool(data["done"])
207
  rewards.append(reward)
208
  steps_taken = step
209
 
210
- if data.get("final_score") is not None:
211
- episode_score = float(data["final_score"])
 
212
  score_from_env = True
213
 
214
  log_step(
@@ -219,8 +224,7 @@ def run() -> None:
219
  error=None,
220
  )
221
 
222
- state = data["state"]
223
-
224
  if done:
225
  break
226
 
@@ -229,13 +233,13 @@ def run() -> None:
229
  success = avg_r > 0.3
230
  if not score_from_env and rewards:
231
  avg_r = sum(rewards) / len(rewards)
232
- episode_score = max(0.0, min(1.0, (avg_r + 1.0) / 2.0))
233
 
234
  except Exception as exc:
235
  success = False
236
  print(f"[RUN] fatal: {exc}", file=sys.stderr)
237
  finally:
238
- episode_score = clamp_strict_unit_interval(episode_score)
239
  log_end(
240
  success=success,
241
  steps=steps_taken,
@@ -244,5 +248,24 @@ def run() -> None:
244
  )
245
 
246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
  if __name__ == "__main__":
248
  run()
 
3
  import sys
4
  import textwrap
5
  from pathlib import Path
6
+ from typing import Any, Dict, List, Optional
7
 
8
  import requests
9
  from openai import OpenAI
10
 
11
+ from env.grader import clamp_unit_interval
12
 
 
13
  try:
14
  from dotenv import load_dotenv
15
 
 
17
  except ImportError:
18
  pass
19
 
 
20
  API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
21
  API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 
 
22
  MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
23
+ ENV_URL = os.getenv(
24
+ "ENV_URL",
25
+ "http://127.0.0.1:7860",
26
+ ).rstrip("/")
27
  BENCHMARK = "cache_invalidation_env"
28
 
29
+ # Reproducibility (Phase 1 / baseline): fixed seed + task → deterministic heuristic run.
30
+ EPISODE_SEED = int(os.getenv("EPISODE_SEED", "42"))
31
+ TASK_ID = os.getenv("TASK_ID", "easy")
32
+
33
  if not API_KEY:
34
  print(
35
+ "WARNING: HF_TOKEN is not set. LLM calls will fail; the script will use the "
36
+ "heuristic policy only.",
37
  file=sys.stderr,
38
  )
39
 
40
  client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY or "hf-invalid")
41
 
42
+ MEMORY: Dict[str, Any] = {}
43
+ LAST_USED: Optional[str] = None
44
 
45
  SYSTEM_PROMPT = textwrap.dedent(
46
  """
47
+ You are a cache invalidation agent. Given the environment observation (JSON), reply with exactly one JSON object
48
  on a single line, no markdown, with keys "type" and "key". type must be one of: invalidate, refresh, keep.
49
+ key must match one of the item keys in observation["items"].
50
  """
51
  ).strip()
52
 
 
74
  )
75
 
76
 
77
+ def select_item(obs: Dict[str, Any], step: int) -> Dict[str, Any]:
78
  global LAST_USED
79
+ items = obs["items"]
80
 
81
+ def score(item: Dict[str, Any]) -> int:
82
  s = 0
83
  if item["last_result"] == "stale":
84
  s += 3
 
100
  return best
101
 
102
 
103
+ def decide(item: Dict[str, Any], step: int) -> Dict[str, str]:
104
  key = item["key"]
105
  last_result = item["last_result"]
106
  age = item["age"]
 
125
  return {"type": "keep", "key": key}
126
 
127
 
128
+ def llm_action(obs: Dict[str, Any]) -> Optional[dict]:
 
129
  try:
130
  completion = client.chat.completions.create(
131
  model=MODEL_NAME,
 
134
  {
135
  "role": "user",
136
  "content": (
137
+ f"Observation:\n{json.dumps(obs)}\n\n"
138
  'Return JSON only: {"type": "...", "key": "..."}'
139
  ),
140
  },
 
157
  return None
158
 
159
 
160
+ def run_episode(*, env_url: str, task_id: str, seed: int, use_llm: bool) -> None:
161
+ """One episode over OpenEnv HTTP API (wrapped action + observation)."""
162
  global LAST_USED
163
  LAST_USED = None
164
  MEMORY.clear()
 
167
  steps_taken = 0
168
  episode_score = 0.0
169
  success = False
170
+ score_from_env = False
171
 
172
  try:
 
173
  res = requests.post(
174
+ f"{env_url}/reset",
175
+ json={"seed": seed, "task_id": task_id},
176
  headers={"Content-Type": "application/json"},
177
  timeout=60,
178
  )
179
  res.raise_for_status()
180
  body = res.json()
181
+ obs = body.get("observation", body)
182
+ tid = str(obs.get("task_id", task_id))
183
 
184
+ log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
185
 
186
  for step in range(1, 11):
187
+ item = select_item(obs, step)
188
 
189
+ action: Optional[dict] = None
190
+ if use_llm:
191
+ action = llm_action(obs)
192
  if action is None:
193
  action = decide(item, step)
194
 
 
198
  }
199
 
200
  step_res = requests.post(
201
+ f"{env_url}/step",
202
+ json={"action": action},
203
  headers={"Content-Type": "application/json"},
204
  timeout=60,
205
  )
206
  step_res.raise_for_status()
207
  data = step_res.json()
208
 
209
+ reward = float(data["reward"] if data["reward"] is not None else 0.0)
210
  done = bool(data["done"])
211
  rewards.append(reward)
212
  steps_taken = step
213
 
214
+ inner = data.get("observation", {})
215
+ if inner.get("final_score") is not None:
216
+ episode_score = float(inner["final_score"])
217
  score_from_env = True
218
 
219
  log_step(
 
224
  error=None,
225
  )
226
 
227
+ obs = inner
 
228
  if done:
229
  break
230
 
 
233
  success = avg_r > 0.3
234
  if not score_from_env and rewards:
235
  avg_r = sum(rewards) / len(rewards)
236
+ episode_score = clamp_unit_interval((avg_r + 1.0) / 2.0)
237
 
238
  except Exception as exc:
239
  success = False
240
  print(f"[RUN] fatal: {exc}", file=sys.stderr)
241
  finally:
242
+ episode_score = clamp_unit_interval(episode_score)
243
  log_end(
244
  success=success,
245
  steps=steps_taken,
 
248
  )
249
 
250
 
251
+ def run() -> None:
252
+ use_llm = bool(API_KEY and API_KEY != "hf-invalid")
253
+ if os.getenv("RUN_ALL_TASKS", "").lower() in ("1", "true", "yes"):
254
+ for tid in ("easy", "medium", "hard"):
255
+ run_episode(
256
+ env_url=ENV_URL,
257
+ task_id=tid,
258
+ seed=EPISODE_SEED,
259
+ use_llm=use_llm,
260
+ )
261
+ return
262
+ run_episode(
263
+ env_url=ENV_URL,
264
+ task_id=TASK_ID,
265
+ seed=EPISODE_SEED,
266
+ use_llm=use_llm,
267
+ )
268
+
269
+
270
  if __name__ == "__main__":
271
  run()
openenv.yaml CHANGED
@@ -1,8 +1,14 @@
 
1
  name: cache_invalidation_env
2
  version: "1.0.0"
 
 
 
 
3
  description: >
4
- Decision-making environment for cache invalidation under uncertainty.
5
- Three difficulty levels; each task has an episode grader (final_score on done).
 
6
 
7
  tasks:
8
  - name: easy
@@ -10,6 +16,8 @@ tasks:
10
  difficulty: easy
11
  max_steps: 10
12
  grader: true
 
 
13
  score_range: [0.0, 1.0]
14
 
15
  - name: medium
@@ -17,31 +25,24 @@ tasks:
17
  difficulty: medium
18
  max_steps: 10
19
  grader: true
 
 
20
  score_range: [0.0, 1.0]
21
 
22
  - name: hard
23
- description: "Most items and high volatility; staleness signal is noisy and costly mistakes are easier."
24
  difficulty: hard
25
  max_steps: 10
26
  grader: true
 
 
27
  score_range: [0.0, 1.0]
28
 
29
- actions:
30
- type: object
31
- properties:
32
- type:
33
- type: string
34
- key:
35
- type: string
36
-
37
- observations:
38
- type: object
39
-
40
- reward:
41
- type: float
42
-
43
  endpoints:
44
  reset: POST /reset
45
  step: POST /step
46
  state: GET /state
 
 
 
47
  tasks: GET /tasks
 
1
+ spec_version: 1
2
  name: cache_invalidation_env
3
  version: "1.0.0"
4
+ type: space
5
+ runtime: fastapi
6
+ app: server.app:app
7
+ port: 7860
8
  description: >
9
+ Cache invalidation under uncertainty: agents choose invalidate, refresh, or keep per step
10
+ from noisy hit/stale observations. Three difficulty tasks (easy hard), each with a
11
+ programmatic episode grader (final_score in [0,1]).
12
 
13
  tasks:
14
  - name: easy
 
16
  difficulty: easy
17
  max_steps: 10
18
  grader: true
19
+ grader_kind: programmatic
20
+ grader_callable: env.task_graders:easy_agent_grader
21
  score_range: [0.0, 1.0]
22
 
23
  - name: medium
 
25
  difficulty: medium
26
  max_steps: 10
27
  grader: true
28
+ grader_kind: programmatic
29
+ grader_callable: env.task_graders:medium_agent_grader
30
  score_range: [0.0, 1.0]
31
 
32
  - name: hard
33
+ description: "Most items and high volatility; noisy staleness signal and harder tradeoffs."
34
  difficulty: hard
35
  max_steps: 10
36
  grader: true
37
+ grader_kind: programmatic
38
+ grader_callable: env.task_graders:hard_agent_grader
39
  score_range: [0.0, 1.0]
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  endpoints:
42
  reset: POST /reset
43
  step: POST /step
44
  state: GET /state
45
+ schema: GET /schema
46
+ metadata: GET /metadata
47
+ health: GET /health
48
  tasks: GET /tasks
pyproject.toml CHANGED
@@ -17,6 +17,9 @@ dependencies = [
17
  "python-dotenv>=1.0.0",
18
  ]
19
 
 
 
 
20
  [project.scripts]
21
  server = "server.app:main"
22
 
 
17
  "python-dotenv>=1.0.0",
18
  ]
19
 
20
+ [project.optional-dependencies]
21
+ dev = ["pytest>=8.0"]
22
+
23
  [project.scripts]
24
  server = "server.app:main"
25
 
server/app.py CHANGED
@@ -1,12 +1,59 @@
1
- """OpenEnv entry: validator requires server/app.py with def main(...) and if __name__ + main()."""
 
 
 
 
 
2
 
3
  import uvicorn
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- def main(host: str = "0.0.0.0", port: int = 7860):
7
- from app import app as fastapi_app
8
 
9
- uvicorn.run(fastapi_app, host=host, port=port)
 
 
 
10
 
11
 
12
  if __name__ == "__main__":
 
1
+ """OpenEnv FastAPI server: full HTTPEnvServer + task/grader discovery routes."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import os
6
+ from typing import Optional
7
 
8
  import uvicorn
9
+ from openenv.core.env_server import create_fastapi_app
10
+
11
+ from env.cache_environment import CacheInvalidationEnvironment
12
+ from env.models import CacheAction, CacheObservation
13
+ from env.task_graders import TASK_AGENT_GRADERS
14
+ from env.tasks import TASK_MANIFEST, list_graders
15
+
16
+ _singleton: CacheInvalidationEnvironment | None = None
17
+
18
+
19
+ def _env_factory() -> CacheInvalidationEnvironment:
20
+ global _singleton
21
+ if _singleton is None:
22
+ _singleton = CacheInvalidationEnvironment()
23
+ return _singleton
24
+
25
+
26
+ app = create_fastapi_app(
27
+ _env_factory,
28
+ CacheAction,
29
+ CacheObservation,
30
+ max_concurrent_envs=1,
31
+ )
32
+
33
 
34
+ @app.get(
35
+ "/tasks",
36
+ tags=["Environment Info"],
37
+ summary="List tasks and grader registration",
38
+ )
39
+ def http_list_tasks():
40
+ return {
41
+ "tasks": TASK_MANIFEST,
42
+ "graders": list_graders(),
43
+ "grader_registry": {
44
+ name: {
45
+ "enabled": True,
46
+ "qualified_name": f"{fn.__module__}:{fn.__name__}",
47
+ }
48
+ for name, fn in TASK_AGENT_GRADERS.items()
49
+ },
50
+ }
51
 
 
 
52
 
53
+ def main(host: Optional[str] = None, port: Optional[int] = None) -> None:
54
+ host = host or os.environ.get("HOST", "0.0.0.0")
55
+ port = int(port or os.environ.get("PORT", "7860"))
56
+ uvicorn.run(app, host=host, port=port)
57
 
58
 
59
  if __name__ == "__main__":
tests/conftest.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+
3
+
4
+ @pytest.fixture(autouse=True)
5
+ def reset_env_singleton():
6
+ import server.app as sa
7
+
8
+ sa._singleton = None
9
+ yield
10
+ sa._singleton = None
tests/test_phase1.py ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Phase 1 gates: OpenEnv HTTP, three tasks, graders in [0,1], reproducible seed."""
2
+
3
+ import pytest
4
+ from fastapi.testclient import TestClient
5
+
6
+ from env.grader import clamp_unit_interval, evaluate_episode
7
+ from env.task_graders import TASK_AGENT_GRADERS
8
+ from server.app import app
9
+
10
+
11
+ @pytest.fixture
12
+ def client():
13
+ return TestClient(app)
14
+
15
+
16
+ def test_tasks_endpoint_three_graders(client):
17
+ r = client.get("/tasks")
18
+ assert r.status_code == 200
19
+ data = r.json()
20
+ assert len(data["tasks"]) >= 3
21
+ enabled = [t for t in data["tasks"] if t.get("grader")]
22
+ assert len(enabled) >= 3
23
+ assert len(data["grader_registry"]) >= 3
24
+
25
+
26
+ def test_each_task_grader_returns_unit_interval():
27
+ history = [
28
+ {"action": "keep", "is_stale": False},
29
+ {"action": "invalidate", "is_stale": True},
30
+ ]
31
+ for name, fn in TASK_AGENT_GRADERS.items():
32
+ s = fn(history)
33
+ assert 0.0 <= s <= 1.0, (name, s)
34
+
35
+
36
+ def test_reset_step_openenv_shape(client):
37
+ r = client.post("/reset", json={"seed": 123, "task_id": "medium"})
38
+ assert r.status_code == 200
39
+ body = r.json()
40
+ assert set(body.keys()) >= {"observation", "reward", "done"}
41
+ obs = body["observation"]
42
+ assert obs["task_id"] == "medium"
43
+ key = obs["items"][0]["key"]
44
+ s = client.post("/step", json={"action": {"type": "keep", "key": key}})
45
+ assert s.status_code == 200
46
+ assert "observation" in s.json()
47
+
48
+
49
+ def test_reproducible_reset_seed(client):
50
+ a = client.post("/reset", json={"seed": 999, "task_id": "easy"}).json()["observation"]
51
+ b = client.post("/reset", json={"seed": 999, "task_id": "easy"}).json()["observation"]
52
+ assert a["items"] == b["items"]
53
+
54
+
55
+ def test_final_score_in_range(client):
56
+ r = client.post("/reset", json={"seed": 0, "task_id": "easy"})
57
+ obs = r.json()["observation"]
58
+ final = None
59
+ for _ in range(12):
60
+ k = obs["items"][0]["key"]
61
+ d = client.post("/step", json={"action": {"type": "keep", "key": k}}).json()
62
+ obs = d["observation"]
63
+ if obs.get("final_score") is not None:
64
+ final = obs["final_score"]
65
+ break
66
+ assert final is not None
67
+ assert 0.0 <= final <= 1.0
68
+
69
+
70
+ def test_clamp_unit_interval():
71
+ assert clamp_unit_interval(-1) == 0.0
72
+ assert clamp_unit_interval(2) == 1.0
73
+ assert evaluate_episode([]) == 0.0
uv.lock CHANGED
@@ -234,16 +234,23 @@ dependencies = [
234
  { name = "uvicorn", extra = ["standard"] },
235
  ]
236
 
 
 
 
 
 
237
  [package.metadata]
238
  requires-dist = [
239
  { name = "fastapi", specifier = ">=0.100.0" },
240
  { name = "openai", specifier = ">=1.0.0" },
241
  { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
242
  { name = "pydantic", specifier = ">=2.0.0" },
 
243
  { name = "python-dotenv", specifier = ">=1.0.0" },
244
  { name = "requests", specifier = ">=2.28.0" },
245
  { name = "uvicorn", extras = ["standard"], specifier = ">=0.22.0" },
246
  ]
 
247
 
248
  [[package]]
249
  name = "cachetools"
@@ -956,6 +963,15 @@ wheels = [
956
  { url = "https://files.pythonhosted.org/packages/fa/5e/f8e9a1d23b9c20a551a8a02ea3637b4642e22c2626e3a13a9a29cdea99eb/importlib_metadata-8.7.1-py3-none-any.whl", hash = "sha256:5a1f80bf1daa489495071efbb095d75a634cf28a8bc299581244063b53176151", size = 27865, upload-time = "2025-12-21T10:00:18.329Z" },
957
  ]
958
 
 
 
 
 
 
 
 
 
 
959
  [[package]]
960
  name = "jaraco-classes"
961
  version = "3.4.0"
@@ -1893,6 +1909,15 @@ wheels = [
1893
  { url = "https://files.pythonhosted.org/packages/63/d7/97f7e3a6abb67d8080dd406fd4df842c2be0efaf712d1c899c32a075027c/platformdirs-4.9.4-py3-none-any.whl", hash = "sha256:68a9a4619a666ea6439f2ff250c12a853cd1cbd5158d258bd824a7df6be2f868", size = 21216, upload-time = "2026-03-05T18:34:12.172Z" },
1894
  ]
1895
 
 
 
 
 
 
 
 
 
 
1896
  [[package]]
1897
  name = "py-key-value-aio"
1898
  version = "0.4.4"
@@ -2123,6 +2148,24 @@ wheels = [
2123
  { url = "https://files.pythonhosted.org/packages/df/80/fc9d01d5ed37ba4c42ca2b55b4339ae6e200b456be3a1aaddf4a9fa99b8c/pyperclip-1.11.0-py3-none-any.whl", hash = "sha256:299403e9ff44581cb9ba2ffeed69c7aa96a008622ad0c46cb575ca75b5b84273", size = 11063, upload-time = "2025-09-26T14:40:36.069Z" },
2124
  ]
2125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2126
  [[package]]
2127
  name = "python-dateutil"
2128
  version = "2.9.0.post0"
 
234
  { name = "uvicorn", extra = ["standard"] },
235
  ]
236
 
237
+ [package.optional-dependencies]
238
+ dev = [
239
+ { name = "pytest" },
240
+ ]
241
+
242
  [package.metadata]
243
  requires-dist = [
244
  { name = "fastapi", specifier = ">=0.100.0" },
245
  { name = "openai", specifier = ">=1.0.0" },
246
  { name = "openenv-core", extras = ["core"], specifier = ">=0.2.2" },
247
  { name = "pydantic", specifier = ">=2.0.0" },
248
+ { name = "pytest", marker = "extra == 'dev'", specifier = ">=8.0" },
249
  { name = "python-dotenv", specifier = ">=1.0.0" },
250
  { name = "requests", specifier = ">=2.28.0" },
251
  { name = "uvicorn", extras = ["standard"], specifier = ">=0.22.0" },
252
  ]
253
+ provides-extras = ["dev"]
254
 
255
  [[package]]
256
  name = "cachetools"
 
963
  { url = "https://files.pythonhosted.org/packages/fa/5e/f8e9a1d23b9c20a551a8a02ea3637b4642e22c2626e3a13a9a29cdea99eb/importlib_metadata-8.7.1-py3-none-any.whl", hash = "sha256:5a1f80bf1daa489495071efbb095d75a634cf28a8bc299581244063b53176151", size = 27865, upload-time = "2025-12-21T10:00:18.329Z" },
964
  ]
965
 
966
+ [[package]]
967
+ name = "iniconfig"
968
+ version = "2.3.0"
969
+ source = { registry = "https://pypi.org/simple" }
970
+ sdist = { url = "https://files.pythonhosted.org/packages/72/34/14ca021ce8e5dfedc35312d08ba8bf51fdd999c576889fc2c24cb97f4f10/iniconfig-2.3.0.tar.gz", hash = "sha256:c76315c77db068650d49c5b56314774a7804df16fee4402c1f19d6d15d8c4730", size = 20503, upload-time = "2025-10-18T21:55:43.219Z" }
971
+ wheels = [
972
+ { url = "https://files.pythonhosted.org/packages/cb/b1/3846dd7f199d53cb17f49cba7e651e9ce294d8497c8c150530ed11865bb8/iniconfig-2.3.0-py3-none-any.whl", hash = "sha256:f631c04d2c48c52b84d0d0549c99ff3859c98df65b3101406327ecc7d53fbf12", size = 7484, upload-time = "2025-10-18T21:55:41.639Z" },
973
+ ]
974
+
975
  [[package]]
976
  name = "jaraco-classes"
977
  version = "3.4.0"
 
1909
  { url = "https://files.pythonhosted.org/packages/63/d7/97f7e3a6abb67d8080dd406fd4df842c2be0efaf712d1c899c32a075027c/platformdirs-4.9.4-py3-none-any.whl", hash = "sha256:68a9a4619a666ea6439f2ff250c12a853cd1cbd5158d258bd824a7df6be2f868", size = 21216, upload-time = "2026-03-05T18:34:12.172Z" },
1910
  ]
1911
 
1912
+ [[package]]
1913
+ name = "pluggy"
1914
+ version = "1.6.0"
1915
+ source = { registry = "https://pypi.org/simple" }
1916
+ sdist = { url = "https://files.pythonhosted.org/packages/f9/e2/3e91f31a7d2b083fe6ef3fa267035b518369d9511ffab804f839851d2779/pluggy-1.6.0.tar.gz", hash = "sha256:7dcc130b76258d33b90f61b658791dede3486c3e6bfb003ee5c9bfb396dd22f3", size = 69412, upload-time = "2025-05-15T12:30:07.975Z" }
1917
+ wheels = [
1918
+ { url = "https://files.pythonhosted.org/packages/54/20/4d324d65cc6d9205fabedc306948156824eb9f0ee1633355a8f7ec5c66bf/pluggy-1.6.0-py3-none-any.whl", hash = "sha256:e920276dd6813095e9377c0bc5566d94c932c33b27a3e3945d8389c374dd4746", size = 20538, upload-time = "2025-05-15T12:30:06.134Z" },
1919
+ ]
1920
+
1921
  [[package]]
1922
  name = "py-key-value-aio"
1923
  version = "0.4.4"
 
2148
  { url = "https://files.pythonhosted.org/packages/df/80/fc9d01d5ed37ba4c42ca2b55b4339ae6e200b456be3a1aaddf4a9fa99b8c/pyperclip-1.11.0-py3-none-any.whl", hash = "sha256:299403e9ff44581cb9ba2ffeed69c7aa96a008622ad0c46cb575ca75b5b84273", size = 11063, upload-time = "2025-09-26T14:40:36.069Z" },
2149
  ]
2150
 
2151
+ [[package]]
2152
+ name = "pytest"
2153
+ version = "9.0.3"
2154
+ source = { registry = "https://pypi.org/simple" }
2155
+ dependencies = [
2156
+ { name = "colorama", marker = "sys_platform == 'win32'" },
2157
+ { name = "exceptiongroup", marker = "python_full_version < '3.11'" },
2158
+ { name = "iniconfig" },
2159
+ { name = "packaging" },
2160
+ { name = "pluggy" },
2161
+ { name = "pygments" },
2162
+ { name = "tomli", marker = "python_full_version < '3.11'" },
2163
+ ]
2164
+ sdist = { url = "https://files.pythonhosted.org/packages/7d/0d/549bd94f1a0a402dc8cf64563a117c0f3765662e2e668477624baeec44d5/pytest-9.0.3.tar.gz", hash = "sha256:b86ada508af81d19edeb213c681b1d48246c1a91d304c6c81a427674c17eb91c", size = 1572165, upload-time = "2026-04-07T17:16:18.027Z" }
2165
+ wheels = [
2166
+ { url = "https://files.pythonhosted.org/packages/d4/24/a372aaf5c9b7208e7112038812994107bc65a84cd00e0354a88c2c77a617/pytest-9.0.3-py3-none-any.whl", hash = "sha256:2c5efc453d45394fdd706ade797c0a81091eccd1d6e4bccfcd476e2b8e0ab5d9", size = 375249, upload-time = "2026-04-07T17:16:16.13Z" },
2167
+ ]
2168
+
2169
  [[package]]
2170
  name = "python-dateutil"
2171
  version = "2.9.0.post0"