Spaces:

parvpareek
/

cache-env

Sleeping

App Files Files Community

cache-env / README.md

Parv Pareek

done

e75c8ce about 1 month ago

preview code

raw

history blame contribute delete

6.03 kB

	---
	title: Cache Env
	emoji: 🏢
	colorFrom: green
	colorTo: pink
	sdk: docker
	pinned: false
	---

	# Cache invalidation environment (OpenEnv)

	## For judges — what this is

	Problem in one sentence: Backends cache data to go fast; they must decide when to invalidate, softly refresh, or leave cache alone using noisy clues (like real monitoring), not the ground truth.

	Why it matters: Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a short episode an agent can be scored on.

	Our approach: Several cache items per episode with hidden staleness (TTL, update rate). The API exposes only observable fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks one action per step for one key: `invalidate`, `refresh`, or `keep`. Step rewards give partial credit; at episode end a programmatic grader sets `final_score` in [0.0, 1.0].

	Tasks: easy → medium → hard — more items and higher volatility; each task registers a dedicated agent grader (`env/task_graders.py`) and is listed in `openenv.yaml` and `GET /tasks`.

	---

	## OpenEnv spec compliance

	- Typed models: `env/models.py` — `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
	- Environment: `env/cache_environment.py` — `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
	- HTTP server: `server/app.py` — `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus `GET /tasks` for task + grader discovery.
	- Manifest: `openenv.yaml` — `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
	- Client (WebSocket): `env/client.py` — `CacheInvalidationEnvClient` for typed `EnvClient` usage.
	- Shim: `app.py` re-exports `app` for `uvicorn app:app`.

	Standard routes include `/reset`, `/step`, `/state`, `/schema`, `/metadata`, `/health`, `/openapi.json`, `/mcp` (OpenEnv default).

	---

	## Action & observation

	Action (POST `/step` body, OpenEnv wrapped form):

	```json
	{
	"action": {
	"type": "invalidate",
	"key": "item_0"
	}
	}
	```

	`type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.

	Reset (POST `/reset`):

	```json
	{
	"seed": 42,
	"task_id": "easy"
	}
	```

	Use `task_id` or `task_name` with `easy` \| `medium` \| `hard`. Omit both to sample a task. `seed` makes generation reproducible.

	Response shape (reset & step):

	```json
	{
	"observation": {
	"items": [...],
	"step": 0,
	"task_id": "easy",
	"final_score": null,
	"done": false
	},
	"reward": 0.0,
	"done": false
	}
	```

	When `done` is `true`, `observation.final_score` is the episode grader output in [0.0, 1.0].

	---

	## Tasks and graders

	- Registry: `env/task_graders.py` — `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
	- Discovery: `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
	- Episode grader: `env/grader.py` — `evaluate_episode` (freshness, unnecessary invalidations, oscillation).

	---

	## Setup & run

	Install (dev):

	```bash
	uv sync --extra dev
	```

	Local server:

	```bash
	uv run server
	# or
	uvicorn app:app --host 0.0.0.0 --port 7860
	```

	Health check:

	```bash
	curl -s -o /dev/null -w '%{http_code}\n' -X POST \
	-H 'Content-Type: application/json' -d '{}' \
	'http://127.0.0.1:7860/reset'
	```

	Expect `200`.

	Docker: `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port 7860).

	---

	## Baseline inference (`inference.py`)

	- Uses OpenEnv HTTP wire format: wrapped `action`, `observation` in responses.
	- Reproducibility: `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
	- All three tasks: `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
	- Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the heuristic policy runs (no API key required).

	```bash
	export ENV_URL='http://127.0.0.1:7860' # or your Space https://....hf.space
	export EPISODE_SEED=42
	export TASK_ID=easy
	python inference.py

	# Phase-1 style: one process, three tasks
	RUN_ALL_TASKS=1 python inference.py
	```

	---

	## Tests (Phase 1 checks)

	```bash
	uv run pytest tests/ -q
	```

	Covers: `GET /tasks` (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.

	---

	## Validation (pre-submission)

	```bash
	openenv validate
	./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
	docker build .
	```

	---

	## Repository layout

	\| Path \| Purpose \|
	\|------\|---------\|
	\| `env/models.py` \| Typed Action / Observation / State \|
	\| `env/cache_environment.py` \| `Environment` implementation \|
	\| `env/grader.py` \| Step rewards + episode `evaluate_episode` \|
	\| `env/task_graders.py` \| Three named agent graders (registry) \|
	\| `env/tasks.py` \| Task configs + `TASK_MANIFEST` \|
	\| `env/client.py` \| Typed WebSocket `EnvClient` \|
	\| `server/app.py` \| `create_fastapi_app` + `/tasks` \|
	\| `app.py` \| Uvicorn entry shim \|
	\| `inference.py` \| Baseline + `[START]`/`[STEP]`/`[END]` logs \|
	\| `openenv.yaml` \| Full OpenEnv manifest \|
	\| `tests/` \| Phase 1 pytest \|

	---

	## Scoring

	- Per-step `reward`: Shaped (can be negative mid-episode).
	- `final_score`: In [0.0, 1.0] when `done`; combines correctness, unnecessary invalidations, and action stability.

	---

	## Resource notes

	Inference and the env server are lightweight (short episodes, small JSON). Suitable for 2 vCPU / 8 GiB; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode × 3 tasks).