File size: 6,025 Bytes
40f530a
 
 
 
 
 
 
 
 
6c66cc1
4f8cf04
6c66cc1
4f8cf04
6c66cc1
4f8cf04
6c66cc1
4f8cf04
e75c8ce
4f8cf04
e75c8ce
4f8cf04
 
 
e75c8ce
4f8cf04
e75c8ce
 
 
 
 
 
4f8cf04
e75c8ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f8cf04
 
6c66cc1
 
e75c8ce
4f8cf04
 
6c66cc1
4f8cf04
e75c8ce
4f8cf04
 
 
6c66cc1
4f8cf04
e75c8ce
 
 
 
4f8cf04
 
e75c8ce
 
 
6c66cc1
e75c8ce
 
 
4f8cf04
 
 
 
e75c8ce
 
 
 
 
4f8cf04
e75c8ce
 
 
 
 
4f8cf04
 
6c66cc1
 
 
4f8cf04
 
 
 
e75c8ce
4f8cf04
6c66cc1
 
e75c8ce
 
 
 
 
 
 
 
 
 
 
4f8cf04
 
 
e75c8ce
4f8cf04
e75c8ce
 
4f8cf04
 
 
e75c8ce
4f8cf04
e75c8ce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: Cache Env
emoji: 🏒
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
---

# Cache invalidation environment (OpenEnv)

## For judges β€” what this is

**Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth.

**Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.

**Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**.

**Tasks:** **easy β†’ medium β†’ hard** β€” more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**.

---

## OpenEnv spec compliance

- **Typed models:** `env/models.py` β€” `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
- **Environment:** `env/cache_environment.py` β€” `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
- **HTTP server:** `server/app.py` β€” `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery.
- **Manifest:** `openenv.yaml` β€” `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
- **Client (WebSocket):** `env/client.py` β€” `CacheInvalidationEnvClient` for typed `EnvClient` usage.
- **Shim:** `app.py` re-exports `app` for `uvicorn app:app`.

Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default).

---

## Action & observation

**Action (POST `/step` body, OpenEnv wrapped form):**

```json
{
  "action": {
    "type": "invalidate",
    "key": "item_0"
  }
}
```

`type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.

**Reset (POST `/reset`):**

```json
{
  "seed": 42,
  "task_id": "easy"
}
```

Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible.

**Response shape (reset & step):**

```json
{
  "observation": {
    "items": [...],
    "step": 0,
    "task_id": "easy",
    "final_score": null,
    "done": false
  },
  "reward": 0.0,
  "done": false
}
```

When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**.

---

## Tasks and graders

- **Registry:** `env/task_graders.py` β€” `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
- **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
- **Episode grader:** `env/grader.py` β€” `evaluate_episode` (freshness, unnecessary invalidations, oscillation).

---

## Setup & run

**Install (dev):**

```bash
uv sync --extra dev
```

**Local server:**

```bash
uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860
```

**Health check:**

```bash
curl -s -o /dev/null -w '%{http_code}\n' -X POST \
  -H 'Content-Type: application/json' -d '{}' \
  'http://127.0.0.1:7860/reset'
```

Expect `200`.

**Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**).

---

## Baseline inference (`inference.py`)

- Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses.
- **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
- **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
- Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required).

```bash
export ENV_URL='http://127.0.0.1:7860'   # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py

# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py
```

---

## Tests (Phase 1 checks)

```bash
uv run pytest tests/ -q
```

Covers: `GET /tasks` (β‰₯3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.

---

## Validation (pre-submission)

```bash
openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .
```

---

## Repository layout

| Path | Purpose |
|------|---------|
| `env/models.py` | Typed Action / Observation / State |
| `env/cache_environment.py` | `Environment` implementation |
| `env/grader.py` | Step rewards + episode `evaluate_episode` |
| `env/task_graders.py` | **Three named agent graders** (registry) |
| `env/tasks.py` | Task configs + `TASK_MANIFEST` |
| `env/client.py` | Typed WebSocket `EnvClient` |
| `server/app.py` | `create_fastapi_app` + `/tasks` |
| `app.py` | Uvicorn entry shim |
| `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs |
| `openenv.yaml` | Full OpenEnv manifest |
| `tests/` | Phase 1 pytest |

---

## Scoring

- **Per-step `reward`:** Shaped (can be negative mid-episode).
- **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability.

---

## Resource notes

Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode Γ— 3 tasks).