Spaces:
Sleeping
title: Cache Env
emoji: π’
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
Cache invalidation environment (OpenEnv)
For judges β what this is
Problem in one sentence: Backends cache data to go fast; they must decide when to invalidate, softly refresh, or leave cache alone using noisy clues (like real monitoring), not the ground truth.
Why it matters: Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a short episode an agent can be scored on.
Our approach: Several cache items per episode with hidden staleness (TTL, update rate). The API exposes only observable fields (age, access_count, last_result as hit/stale with noise). The agent picks one action per step for one key: invalidate, refresh, or keep. Step rewards give partial credit; at episode end a programmatic grader sets final_score in [0.0, 1.0].
Tasks: easy β medium β hard β more items and higher volatility; each task registers a dedicated agent grader (env/task_graders.py) and is listed in openenv.yaml and GET /tasks.
OpenEnv spec compliance
- Typed models:
env/models.pyβCacheAction,CacheObservation,CacheState(Pydantic,openenv.core.env_serverbases). - Environment:
env/cache_environment.pyβCacheInvalidationEnvironmentimplementsreset/step/state/get_metadata. - HTTP server:
server/app.pyβcreate_fastapi_app(...)fromopenenv-core(singleton env instance for stateful HTTP), plusGET /tasksfor task + grader discovery. - Manifest:
openenv.yamlβspec_version,tasks(each withgrader: true,grader_callable,score_range),endpoints,app: server.app:app,port: 7860. - Client (WebSocket):
env/client.pyβCacheInvalidationEnvClientfor typedEnvClientusage. - Shim:
app.pyre-exportsappforuvicorn app:app.
Standard routes include /reset, /step, /state, /schema, /metadata, /health, /openapi.json, /mcp (OpenEnv default).
Action & observation
Action (POST /step body, OpenEnv wrapped form):
{
"action": {
"type": "invalidate",
"key": "item_0"
}
}
type is one of: invalidate, refresh, keep. key must match an item in the current observation.
Reset (POST /reset):
{
"seed": 42,
"task_id": "easy"
}
Use task_id or task_name with easy | medium | hard. Omit both to sample a task. seed makes generation reproducible.
Response shape (reset & step):
{
"observation": {
"items": [...],
"step": 0,
"task_id": "easy",
"final_score": null,
"done": false
},
"reward": 0.0,
"done": false
}
When done is true, observation.final_score is the episode grader output in [0.0, 1.0].
Tasks and graders
- Registry:
env/task_graders.pyβTASK_AGENT_GRADERSmapseasy/medium/hardto distinct callables (same rubric; difficulty comes from env dynamics). - Discovery:
GET /tasksreturnstasks,graders, andgrader_registryfor automated validation. - Episode grader:
env/grader.pyβevaluate_episode(freshness, unnecessary invalidations, oscillation).
Setup & run
Install (dev):
uv sync --extra dev
Local server:
uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860
Health check:
curl -s -o /dev/null -w '%{http_code}\n' -X POST \
-H 'Content-Type: application/json' -d '{}' \
'http://127.0.0.1:7860/reset'
Expect 200.
Docker: docker build -t cache-env . then run with the same CMD as in the Dockerfile (uvicorn app:app, port 7860).
Baseline inference (inference.py)
- Uses OpenEnv HTTP wire format: wrapped
action,observationin responses. - Reproducibility:
EPISODE_SEED(default42) andTASK_ID(defaulteasy). - All three tasks:
RUN_ALL_TASKS=1runseasy, thenmedium, thenhardwith the same seed (fast on CPU; well under 20 minutes). - Optional LLM path: set
HF_TOKEN,API_BASE_URL,MODEL_NAME; otherwise the heuristic policy runs (no API key required).
export ENV_URL='http://127.0.0.1:7860' # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py
# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py
Tests (Phase 1 checks)
uv run pytest tests/ -q
Covers: GET /tasks (β₯3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode final_score.
Validation (pre-submission)
openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .
Repository layout
| Path | Purpose |
|---|---|
env/models.py |
Typed Action / Observation / State |
env/cache_environment.py |
Environment implementation |
env/grader.py |
Step rewards + episode evaluate_episode |
env/task_graders.py |
Three named agent graders (registry) |
env/tasks.py |
Task configs + TASK_MANIFEST |
env/client.py |
Typed WebSocket EnvClient |
server/app.py |
create_fastapi_app + /tasks |
app.py |
Uvicorn entry shim |
inference.py |
Baseline + [START]/[STEP]/[END] logs |
openenv.yaml |
Full OpenEnv manifest |
tests/ |
Phase 1 pytest |
Scoring
- Per-step
reward: Shaped (can be negative mid-episode). final_score: In [0.0, 1.0] whendone; combines correctness, unnecessary invalidations, and action stability.
Resource notes
Inference and the env server are lightweight (short episodes, small JSON). Suitable for 2 vCPU / 8 GiB; keep RUN_ALL_TASKS episodes bounded (fixed 10 steps per episode Γ 3 tasks).