Spaces:

parvpareek
/

cache-env

Sleeping

File size: 6,025 Bytes

40f530a
 
 
 
 
 
 
 
 
6c66cc1
4f8cf04
6c66cc1
4f8cf04
6c66cc1
4f8cf04
6c66cc1
4f8cf04
e75c8ce
4f8cf04
e75c8ce
4f8cf04
 
 
e75c8ce
4f8cf04
e75c8ce
 
 
 
 
 
4f8cf04
e75c8ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4f8cf04
 
6c66cc1
 
e75c8ce
4f8cf04
 
6c66cc1
4f8cf04
e75c8ce
4f8cf04
 
 
6c66cc1
4f8cf04
e75c8ce
 
 
 
4f8cf04
 
e75c8ce
 
 
6c66cc1
e75c8ce
 
 
4f8cf04
 
 
 
e75c8ce
 
 
 
 
4f8cf04
e75c8ce
 
 
 
 
4f8cf04
 
6c66cc1
 
 
4f8cf04
 
 
 
e75c8ce
4f8cf04
6c66cc1
 
e75c8ce
 
 
 
 
 
 
 
 
 
 
4f8cf04
 
 
e75c8ce
4f8cf04
e75c8ce
 
4f8cf04
 
 
e75c8ce
4f8cf04
e75c8ce

---
title: Cache Env
emoji: 🏢
colorFrom: green
colorTo: pink
sdk: docker
pinned: false
---

# Cache invalidation environment (OpenEnv)

## For judges — what this is

**Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth.

**Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.

**Our approach:** Several cache **items** per episode with hidden staleness (TTL, update rate). The API exposes only **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks **one action per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **programmatic grader** sets **`final_score` in [0.0, 1.0]**.

**Tasks:** **easy → medium → hard** — more items and higher volatility; each task registers a dedicated **agent grader** (`env/task_graders.py`) and is listed in `openenv.yaml` and **`GET /tasks`**.

---

## OpenEnv spec compliance

- **Typed models:** `env/models.py` — `CacheAction`, `CacheObservation`, `CacheState` (Pydantic, `openenv.core.env_server` bases).
- **Environment:** `env/cache_environment.py` — `CacheInvalidationEnvironment` implements `reset` / `step` / `state` / `get_metadata`.
- **HTTP server:** `server/app.py` — `create_fastapi_app(...)` from `openenv-core` (singleton env instance for stateful HTTP), plus **`GET /tasks`** for task + grader discovery.
- **Manifest:** `openenv.yaml` — `spec_version`, `tasks` (each with `grader: true`, `grader_callable`, `score_range`), `endpoints`, `app: server.app:app`, `port: 7860`.
- **Client (WebSocket):** `env/client.py` — `CacheInvalidationEnvClient` for typed `EnvClient` usage.
- **Shim:** `app.py` re-exports `app` for `uvicorn app:app`.

Standard routes include **`/reset`**, **`/step`**, **`/state`**, **`/schema`**, **`/metadata`**, **`/health`**, **`/openapi.json`**, **`/mcp`** (OpenEnv default).

---

## Action & observation

**Action (POST `/step` body, OpenEnv wrapped form):**

```json
{
  "action": {
    "type": "invalidate",
    "key": "item_0"
  }
}
```

`type` is one of: `invalidate`, `refresh`, `keep`. `key` must match an item in the current observation.

**Reset (POST `/reset`):**

```json
{
  "seed": 42,
  "task_id": "easy"
}
```

Use `task_id` or `task_name` with `easy` | `medium` | `hard`. Omit both to sample a task. `seed` makes generation reproducible.

**Response shape (reset & step):**

```json
{
  "observation": {
    "items": [...],
    "step": 0,
    "task_id": "easy",
    "final_score": null,
    "done": false
  },
  "reward": 0.0,
  "done": false
}
```

When `done` is `true`, `observation.final_score` is the episode grader output in **[0.0, 1.0]**.

---

## Tasks and graders

- **Registry:** `env/task_graders.py` — `TASK_AGENT_GRADERS` maps `easy` / `medium` / `hard` to distinct callables (same rubric; difficulty comes from env dynamics).
- **Discovery:** `GET /tasks` returns `tasks`, `graders`, and `grader_registry` for automated validation.
- **Episode grader:** `env/grader.py` — `evaluate_episode` (freshness, unnecessary invalidations, oscillation).

---

## Setup & run

**Install (dev):**

```bash
uv sync --extra dev
```

**Local server:**

```bash
uv run server
# or
uvicorn app:app --host 0.0.0.0 --port 7860
```

**Health check:**

```bash
curl -s -o /dev/null -w '%{http_code}\n' -X POST \
  -H 'Content-Type: application/json' -d '{}' \
  'http://127.0.0.1:7860/reset'
```

Expect `200`.

**Docker:** `docker build -t cache-env .` then run with the same `CMD` as in the `Dockerfile` (`uvicorn app:app`, port **7860**).

---

## Baseline inference (`inference.py`)

- Uses **OpenEnv HTTP** wire format: wrapped `action`, `observation` in responses.
- **Reproducibility:** `EPISODE_SEED` (default `42`) and `TASK_ID` (default `easy`).
- **All three tasks:** `RUN_ALL_TASKS=1` runs `easy`, then `medium`, then `hard` with the same seed (fast on CPU; well under 20 minutes).
- Optional LLM path: set `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`; otherwise the **heuristic** policy runs (no API key required).

```bash
export ENV_URL='http://127.0.0.1:7860'   # or your Space https://....hf.space
export EPISODE_SEED=42
export TASK_ID=easy
python inference.py

# Phase-1 style: one process, three tasks
RUN_ALL_TASKS=1 python inference.py
```

---

## Tests (Phase 1 checks)

```bash
uv run pytest tests/ -q
```

Covers: `GET /tasks` (≥3 tasks with graders), grader outputs in [0,1], OpenEnv reset/step JSON shape, reproducible seed, full episode `final_score`.

---

## Validation (pre-submission)

```bash
openenv validate
./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
docker build .
```

---

## Repository layout

| Path | Purpose |
|------|---------|
| `env/models.py` | Typed Action / Observation / State |
| `env/cache_environment.py` | `Environment` implementation |
| `env/grader.py` | Step rewards + episode `evaluate_episode` |
| `env/task_graders.py` | **Three named agent graders** (registry) |
| `env/tasks.py` | Task configs + `TASK_MANIFEST` |
| `env/client.py` | Typed WebSocket `EnvClient` |
| `server/app.py` | `create_fastapi_app` + `/tasks` |
| `app.py` | Uvicorn entry shim |
| `inference.py` | Baseline + `[START]`/`[STEP]`/`[END]` logs |
| `openenv.yaml` | Full OpenEnv manifest |
| `tests/` | Phase 1 pytest |

---

## Scoring

- **Per-step `reward`:** Shaped (can be negative mid-episode).
- **`final_score`:** In **[0.0, 1.0]** when `done`; combines correctness, unnecessary invalidations, and action stability.

---

## Resource notes

Inference and the env server are lightweight (short episodes, small JSON). Suitable for **2 vCPU / 8 GiB**; keep `RUN_ALL_TASKS` episodes bounded (fixed 10 steps per episode × 3 tasks).