Spaces:

parvpareek
/

cache-env

Sleeping

App Files Files Community

cache-env / README.md

Parv Pareek

done

e75c8ce about 1 month ago

preview code

raw

history blame contribute delete

6.03 kB

metadata

title: Cache Env
emoji: 🏢
colorFrom: green
colorTo: pink
sdk: docker
pinned: false

Cache invalidation environment (OpenEnv)

For judges — what this is

Problem in one sentence: Backends cache data to go fast; they must decide when to invalidate, softly refresh, or leave cache alone using noisy clues (like real monitoring), not the ground truth.

Why it matters: Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a short episode an agent can be scored on.

Our approach: Several cache items per episode with hidden staleness (TTL, update rate). The API exposes only observable fields (age, access_count, last_result as hit/stale with noise). The agent picks one action per step for one key: invalidate, refresh, or keep. Step rewards give partial credit; at episode end a programmatic grader sets final_score in [0.0, 1.0].

Tasks: easy → medium → hard — more items and higher volatility; each task registers a dedicated agent grader (env/task_graders.py) and is listed in openenv.yaml and GET /tasks.

OpenEnv spec compliance

Typed models: env/models.py — CacheAction, CacheObservation, CacheState (Pydantic, openenv.core.env_server bases).
Environment: env/cache_environment.py — CacheInvalidationEnvironment implements reset / step / state / get_metadata.
HTTP server: server/app.py — create_fastapi_app(...) from openenv-core (singleton env instance for stateful HTTP), plus GET /tasks for task + grader discovery.
Manifest: openenv.yaml — spec_version, tasks (each with grader: true, grader_callable, score_range), endpoints, app: server.app:app, port: 7860.
Client (WebSocket): env/client.py — CacheInvalidationEnvClient for typed EnvClient usage.
Shim: app.py re-exports app for uvicorn app:app.

Standard routes include /reset, /step, /state, /schema, /metadata, /health, /openapi.json, /mcp (OpenEnv default).

Action & observation

Action (POST /step body, OpenEnv wrapped form):

{
  "action": {
    "type": "invalidate",
    "key": "item_0"
  }
}

type is one of: invalidate, refresh, keep. key must match an item in the current observation.

Reset (POST /reset):

{
  "seed": 42,
  "task_id": "easy"
}

Use task_id or task_name with easy | medium | hard. Omit both to sample a task. seed makes generation reproducible.

Response shape (reset & step):

{
  "observation": {
    "items": [...],
    "step": 0,
    "task_id": "easy",
    "final_score": null,
    "done": false
  },
  "reward": 0.0,
  "done": false
}

When done is true, observation.final_score is the episode grader output in [0.0, 1.0].

Tasks and graders

Registry: env/task_graders.py — TASK_AGENT_GRADERS maps easy / medium / hard to distinct callables (same rubric; difficulty comes from env dynamics).
Discovery: GET /tasks returns tasks, graders, and grader_registry for automated validation.
Episode grader: env/grader.py — evaluate_episode (freshness, unnecessary invalidations, oscillation).

Setup & run

Install (dev):

uv sync --extra dev

Local server:

uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860

Health check:

curl -s -o /dev/null -w '%{http_code}\n' -X POST \
  -H 'Content-Type: application/json' -d '{}' \
  'http://127.0.0.1:7860/reset'

Expect 200.

Docker: docker build -t cache-env . then run with the same CMD as in the Dockerfile (uvicorn app:app, port 7860).

Baseline inference (`inference.py`)

Uses OpenEnv HTTP wire format: wrapped action, observation in responses.
Reproducibility: EPISODE_SEED (default 42) and TASK_ID (default easy).
All three tasks: RUN_ALL_TASKS=1 runs easy, then medium, then hard with the same seed (fast on CPU; well under 20 minutes).
Optional LLM path: set HF_TOKEN, API_BASE_URL, MODEL_NAME; otherwise the heuristic policy runs (no API key required).

export ENV_URL='http://127.0.0.1:7860'   # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py

# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py

Tests (Phase 1 checks)

uv run pytest tests/ -q

Covers: GET /tasks (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode final_score.

Validation (pre-submission)

openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .

Repository layout

Path	Purpose
`env/models.py`	Typed Action / Observation / State
`env/cache_environment.py`	`Environment` implementation
`env/grader.py`	Step rewards + episode `evaluate_episode`
`env/task_graders.py`	Three named agent graders (registry)
`env/tasks.py`	Task configs + `TASK_MANIFEST`
`env/client.py`	Typed WebSocket `EnvClient`
`server/app.py`	`create_fastapi_app` + `/tasks`
`app.py`	Uvicorn entry shim
`inference.py`	Baseline + `[START]`/`[STEP]`/`[END]` logs
`openenv.yaml`	Full OpenEnv manifest
`tests/`	Phase 1 pytest

Scoring

Per-step reward: Shaped (can be negative mid-episode).
final_score: In [0.0, 1.0] when done; combines correctness, unnecessary invalidations, and action stability.

Resource notes

Inference and the env server are lightweight (short episodes, small JSON). Suitable for 2 vCPU / 8 GiB; keep RUN_ALL_TASKS episodes bounded (fixed 10 steps per episode × 3 tasks).