Spaces:
Sleeping
Sleeping
| title: Cache Env | |
| emoji: 🏢 | |
| colorFrom: green | |
| colorTo: pink | |
| sdk: docker | |
| pinned: false | |
| # Cache invalidation environment (OpenEnv) | |
| ## For judges — what this is | |
| **Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth. | |
| **Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on. | |
| **Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**. | |
| **Tasks:** **easy → medium → hard** — more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**. | |
| --- | |
| ## OpenEnv spec compliance | |
| - **Typed models:** `env/models.py` — `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases). | |
| - **Environment:** `env/cache_environment.py` — `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`. | |
| - **HTTP server:** `server/app.py` — `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery. | |
| - **Manifest:** `openenv.yaml` — `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`. | |
| - **Client (WebSocket):** `env/client.py` — `CacheInvalidationEnvClient` for typed `EnvClient` usage. | |
| - **Shim:** `app.py` re-exports `app` for `uvicorn app:app`. | |
| Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default). | |
| --- | |
| ## Action & observation | |
| **Action (POST `/step` body, OpenEnv wrapped form):** | |
| ```json | |
| { | |
| "action": { | |
| "type": "invalidate", | |
| "key": "item_0" | |
| } | |
| } | |
| ``` | |
| `type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation. | |
| **Reset (POST `/reset`):** | |
| ```json | |
| { | |
| "seed": 42, | |
| "task_id": "easy" | |
| } | |
| ``` | |
| Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible. | |
| **Response shape (reset & step):** | |
| ```json | |
| { | |
| "observation": { | |
| "items": [...], | |
| "step": 0, | |
| "task_id": "easy", | |
| "final_score": null, | |
| "done": false | |
| }, | |
| "reward": 0.0, | |
| "done": false | |
| } | |
| ``` | |
| When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**. | |
| --- | |
| ## Tasks and graders | |
| - **Registry:** `env/task_graders.py` — `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics). | |
| - **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation. | |
| - **Episode grader:** `env/grader.py` — `evaluate_episode` (freshness, unnecessary invalidations, oscillation). | |
| --- | |
| ## Setup & run | |
| **Install (dev):** | |
| ```bash | |
| uv sync --extra dev | |
| ``` | |
| **Local server:** | |
| ```bash | |
| uv run server | |
| # or | |
| uvicorn app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| **Health check:** | |
| ```bash | |
| curl -s -o /dev/null -w '%{http_code}\n' -X POST \ | |
| -H 'Content-Type: application/json' -d '{}' \ | |
| 'http://127.0.0.1:7860/reset' | |
| ``` | |
| Expect `200`. | |
| **Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**). | |
| --- | |
| ## Baseline inference (`inference.py`) | |
| - Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses. | |
| - **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`). | |
| - **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes). | |
| - Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required). | |
| ```bash | |
| export ENV_URL='http://127.0.0.1:7860' # or your Space https://....hf.space | |
| export EPISODE_SEED=42 | |
| export TASK_ID=easy | |
| python inference.py | |
| # Phase-1 style: one process, three tasks | |
| RUN_ALL_TASKS=1 python inference.py | |
| ``` | |
| --- | |
| ## Tests (Phase 1 checks) | |
| ```bash | |
| uv run pytest tests/ -q | |
| ``` | |
| Covers: `GET /tasks` (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`. | |
| --- | |
| ## Validation (pre-submission) | |
| ```bash | |
| openenv validate | |
| ./validate-submission.sh 'https://YOUR-SPACE.hf.space' . | |
| docker build . | |
| ``` | |
| --- | |
| ## Repository layout | |
| | Path | Purpose | | |
| |------|---------| | |
| | `env/models.py` | Typed Action / Observation / State | | |
| | `env/cache_environment.py` | `Environment` implementation | | |
| | `env/grader.py` | Step rewards + episode `evaluate_episode` | | |
| | `env/task_graders.py` | **Three named agent graders** (registry) | | |
| | `env/tasks.py` | Task configs + `TASK_MANIFEST` | | |
| | `env/client.py` | Typed WebSocket `EnvClient` | | |
| | `server/app.py` | `create_fastapi_app` + `/tasks` | | |
| | `app.py` | Uvicorn entry shim | | |
| | `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs | | |
| | `openenv.yaml` | Full OpenEnv manifest | | |
| | `tests/` | Phase 1 pytest | | |
| --- | |
| ## Scoring | |
| - **Per-step `reward`:** Shaped (can be negative mid-episode). | |
| - **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability. | |
| --- | |
| ## Resource notes | |
| Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode × 3 tasks). | |