Spaces:
Sleeping
Sleeping
File size: 6,025 Bytes
40f530a 6c66cc1 4f8cf04 6c66cc1 4f8cf04 6c66cc1 4f8cf04 6c66cc1 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 6c66cc1 e75c8ce 4f8cf04 6c66cc1 4f8cf04 e75c8ce 4f8cf04 6c66cc1 4f8cf04 e75c8ce 4f8cf04 e75c8ce 6c66cc1 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 6c66cc1 4f8cf04 e75c8ce 4f8cf04 6c66cc1 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce 4f8cf04 e75c8ce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | ---
title: Cache Env
emoji: π’
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
---
# Cache invalidation environment (OpenEnv)
## For judges β what this is
**Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth.
**Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
**Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**.
**Tasks:** **easy β medium β hard** β more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**.
---
## OpenEnv spec compliance
- **Typed models:** `env/models.py` β `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
- **Environment:** `env/cache_environment.py` β `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
- **HTTP server:** `server/app.py` β `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery.
- **Manifest:** `openenv.yaml` β `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
- **Client (WebSocket):** `env/client.py` β `CacheInvalidationEnvClient` for typed `EnvClient` usage.
- **Shim:** `app.py` re-exports `app` for `uvicorn app:app`.
Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default).
---
## Action & observation
**Action (POST `/step` body, OpenEnv wrapped form):**
```json
{
"action": {
"type": "invalidate",
"key": "item_0"
}
}
```
`type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.
**Reset (POST `/reset`):**
```json
{
"seed": 42,
"task_id": "easy"
}
```
Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible.
**Response shape (reset & step):**
```json
{
"observation": {
"items": [...],
"step": 0,
"task_id": "easy",
"final_score": null,
"done": false
},
"reward": 0.0,
"done": false
}
```
When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**.
---
## Tasks and graders
- **Registry:** `env/task_graders.py` β `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
- **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
- **Episode grader:** `env/grader.py` β `evaluate_episode` (freshness, unnecessary invalidations, oscillation).
---
## Setup & run
**Install (dev):**
```bash
uv sync --extra dev
```
**Local server:**
```bash
uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860
```
**Health check:**
```bash
curl -s -o /dev/null -w '%{http_code}\n' -X POST \
-H 'Content-Type: application/json' -d '{}' \
'http://127.0.0.1:7860/reset'
```
Expect `200`.
**Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**).
---
## Baseline inference (`inference.py`)
- Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses.
- **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
- **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
- Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required).
```bash
export ENV_URL='http://127.0.0.1:7860' # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py
# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py
```
---
## Tests (Phase 1 checks)
```bash
uv run pytest tests/ -q
```
Covers: `GET /tasks` (β₯3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.
---
## Validation (pre-submission)
```bash
openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .
```
---
## Repository layout
| Path | Purpose |
|------|---------|
| `env/models.py` | Typed Action / Observation / State |
| `env/cache_environment.py` | `Environment` implementation |
| `env/grader.py` | Step rewards + episode `evaluate_episode` |
| `env/task_graders.py` | **Three named agent graders** (registry) |
| `env/tasks.py` | Task configs + `TASK_MANIFEST` |
| `env/client.py` | Typed WebSocket `EnvClient` |
| `server/app.py` | `create_fastapi_app` + `/tasks` |
| `app.py` | Uvicorn entry shim |
| `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs |
| `openenv.yaml` | Full OpenEnv manifest |
| `tests/` | Phase 1 pytest |
---
## Scoring
- **Per-step `reward`:** Shaped (can be negative mid-episode).
- **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability.
---
## Resource notes
Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode Γ 3 tasks).
|