cache-env / README.md
Parv Pareek
done
e75c8ce
---
title: Cache Env
emoji: 🏢
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
---
# Cache invalidation environment (OpenEnv)
## For judges — what this is
**Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth.
**Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
**Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**.
**Tasks:** **easy → medium → hard** — more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**.
---
## OpenEnv spec compliance
- **Typed models:** `env/models.py``CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
- **Environment:** `env/cache_environment.py``CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
- **HTTP server:** `server/app.py``create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery.
- **Manifest:** `openenv.yaml``spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
- **Client (WebSocket):** `env/client.py``CacheInvalidationEnvClient` for typed `EnvClient` usage.
- **Shim:** `app.py` re-exports `app` for `uvicorn app:app`.
Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default).
---
## Action & observation
**Action (POST `/step` body, OpenEnv wrapped form):**
```json
{
"action": {
"type": "invalidate",
"key": "item_0"
}
}
```
`type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.
**Reset (POST `/reset`):**
```json
{
"seed": 42,
"task_id": "easy"
}
```
Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible.
**Response shape (reset & step):**
```json
{
"observation": {
"items": [...],
"step": 0,
"task_id": "easy",
"final_score": null,
"done": false
},
"reward": 0.0,
"done": false
}
```
When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**.
---
## Tasks and graders
- **Registry:** `env/task_graders.py``TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
- **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
- **Episode grader:** `env/grader.py``evaluate_episode` (freshness, unnecessary invalidations, oscillation).
---
## Setup & run
**Install (dev):**
```bash
uv sync --extra dev
```
**Local server:**
```bash
uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860
```
**Health check:**
```bash
curl -s -o /dev/null -w '%{http_code}\n' -X POST \
-H 'Content-Type: application/json' -d '{}' \
'http://127.0.0.1:7860/reset'
```
Expect `200`.
**Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**).
---
## Baseline inference (`inference.py`)
- Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses.
- **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
- **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
- Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required).
```bash
export ENV_URL='http://127.0.0.1:7860' # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py
# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py
```
---
## Tests (Phase 1 checks)
```bash
uv run pytest tests/ -q
```
Covers: `GET /tasks` (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.
---
## Validation (pre-submission)
```bash
openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .
```
---
## Repository layout
| Path | Purpose |
|------|---------|
| `env/models.py` | Typed Action / Observation / State |
| `env/cache_environment.py` | `Environment` implementation |
| `env/grader.py` | Step rewards + episode `evaluate_episode` |
| `env/task_graders.py` | **Three named agent graders** (registry) |
| `env/tasks.py` | Task configs + `TASK_MANIFEST` |
| `env/client.py` | Typed WebSocket `EnvClient` |
| `server/app.py` | `create_fastapi_app` + `/tasks` |
| `app.py` | Uvicorn entry shim |
| `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs |
| `openenv.yaml` | Full OpenEnv manifest |
| `tests/` | Phase 1 pytest |
---
## Scoring
- **Per-step `reward`:** Shaped (can be negative mid-episode).
- **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability.
---
## Resource notes
Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode × 3 tasks).