cache-env / README.md
Parv Pareek
done
e75c8ce
metadata
title: Cache Env
emoji: 🏒
colorFrom: green
colorTo: pink
sdk: docker
pinned: false

Cache invalidation environment (OpenEnv)

For judges β€” what this is

Problem in one sentence: Backends cache data to go fast; they must decide when to invalidate, softly refresh, or leave cache alone using noisy clues (like real monitoring), not the ground truth.

Why it matters: Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a short episode an agent can be scored on.

Our approach: Several cache items per episode with hidden staleness (TTL, update rate). The API exposes only observable fields (age, access_count, last_result as hit/stale with noise). The agent picks one action per step for one key: invalidate, refresh, or keep. Step rewards give partial credit; at episode end a programmatic grader sets final_score in [0.0, 1.0].

Tasks: easy β†’ medium β†’ hard β€” more items and higher volatility; each task registers a dedicated agent grader (env/task_graders.py) and is listed in openenv.yaml and GET /tasks.


OpenEnv spec compliance

  • Typed models: env/models.py β€” CacheAction, CacheObservation, CacheState (Pydantic, openenv.core.env_server bases).
  • Environment: env/cache_environment.py β€” CacheInvalidationEnvironment implements reset / step / state / get_metadata.
  • HTTP server: server/app.py β€” create_fastapi_app(...) from openenv-core (singleton env instance for stateful HTTP), plus GET /tasks for task + grader discovery.
  • Manifest: openenv.yaml β€” spec_version, tasks (each with grader: true, grader_callable, score_range), endpoints, app: server.app:app, port: 7860.
  • Client (WebSocket): env/client.py β€” CacheInvalidationEnvClient for typed EnvClient usage.
  • Shim: app.py re-exports app for uvicorn app:app.

Standard routes include /reset, /step, /state, /schema, /metadata, /health, /openapi.json, /mcp (OpenEnv default).


Action & observation

Action (POST /step body, OpenEnv wrapped form):

{
  "action": {
    "type": "invalidate",
    "key": "item_0"
  }
}

type is one of: invalidate, refresh, keep. key must match an item in the current observation.

Reset (POST /reset):

{
  "seed": 42,
  "task_id": "easy"
}

Use task_id or task_name with easy | medium | hard. Omit both to sample a task. seed makes generation reproducible.

Response shape (reset & step):

{
  "observation": {
    "items": [...],
    "step": 0,
    "task_id": "easy",
    "final_score": null,
    "done": false
  },
  "reward": 0.0,
  "done": false
}

When done is true, observation.final_score is the episode grader output in [0.0, 1.0].


Tasks and graders

  • Registry: env/task_graders.py β€” TASK_AGENT_GRADERS maps easy / medium / hard to distinct callables (same rubric; difficulty comes from env dynamics).
  • Discovery: GET /tasks returns tasks, graders, and grader_registry for automated validation.
  • Episode grader: env/grader.py β€” evaluate_episode (freshness, unnecessary invalidations, oscillation).

Setup & run

Install (dev):

uv sync --extra dev

Local server:

uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860

Health check:

curl -s -o /dev/null -w '%{http_code}\n' -X POST \
  -H 'Content-Type: application/json' -d '{}' \
  'http://127.0.0.1:7860/reset'

Expect 200.

Docker: docker build -t cache-env . then run with the same CMD as in the Dockerfile (uvicorn app:app, port 7860).


Baseline inference (inference.py)

  • Uses OpenEnv HTTP wire format: wrapped action, observation in responses.
  • Reproducibility: EPISODE_SEED (default 42) and TASK_ID (default easy).
  • All three tasks: RUN_ALL_TASKS=1 runs easy, then medium, then hard with the same seed (fast on CPU; well under 20 minutes).
  • Optional LLM path: set HF_TOKEN, API_BASE_URL, MODEL_NAME; otherwise the heuristic policy runs (no API key required).
export ENV_URL='http://127.0.0.1:7860'   # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py

# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py

Tests (Phase 1 checks)

uv run pytest tests/ -q

Covers: GET /tasks (β‰₯3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode final_score.


Validation (pre-submission)

openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .

Repository layout

Path Purpose
env/models.py Typed Action / Observation / State
env/cache_environment.py Environment implementation
env/grader.py Step rewards + episode evaluate_episode
env/task_graders.py Three named agent graders (registry)
env/tasks.py Task configs + TASK_MANIFEST
env/client.py Typed WebSocket EnvClient
server/app.py create_fastapi_app + /tasks
app.py Uvicorn entry shim
inference.py Baseline + [START]/[STEP]/[END] logs
openenv.yaml Full OpenEnv manifest
tests/ Phase 1 pytest

Scoring

  • Per-step reward: Shaped (can be negative mid-episode).
  • final_score: In [0.0, 1.0] when done; combines correctness, unnecessary invalidations, and action stability.

Resource notes

Inference and the env server are lightweight (short episodes, small JSON). Suitable for 2 vCPU / 8 GiB; keep RUN_ALL_TASKS episodes bounded (fixed 10 steps per episode Γ— 3 tasks).