Spaces:

parvpareek
/

cache-env

Sleeping

App Files Files Community

Parv Pareek commited on Apr 5

Commit

6c66cc1

1 Parent(s): 32ec139

update: add readme

Browse files

Files changed (1) hide show

README.md +54 -257

README.md CHANGED Viewed

@@ -7,303 +7,100 @@ sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-# 🧠 Cache Invalidation Environment (OpenEnv)
-## 📌 Overview
-This project implements a **real-world cache invalidation decision environment** using the OpenEnv specification.
-Cache invalidation is a fundamental systems problem: deciding **when to refresh cached data vs reuse it**. Acting too early wastes resources, while acting too late serves stale data.
-This environment simulates that tradeoff under **uncertainty and noisy signals**, allowing evaluation of agent decision-making.
----
-## 🎯 Motivation
-Cache invalidation is widely used in:
-* Distributed systems
-* Web backends
-* CDNs and edge caching
-* Databases
-This environment models a **practical decision problem engineers face daily**, making it useful for evaluating reasoning-based agents.
----
-## 🧩 Environment Design
-### State (Observation)
-Each step returns:
-```json
-{
-  "items": [
-    {
-      "key": "item_0",
-      "age": 5,
-      "access_count": 12,
-      "last_result": "hit"
-    }
-  ],
-  "step": 3,
-  "task_id": "medium"
-}
-```
-#### Field meanings:
-* `age`: time since last refresh
-* `access_count`: usage frequency
-* `last_result`: "hit" or "stale" (noisy signal)
-* `task_id`: difficulty level
----
-### Actions
-Agent must return:
-```json
-{
-  "type": "invalidate | refresh | keep",
-  "key": "item_id"
-}
-```
-#### Action meanings:
-* `invalidate`: reset cache (high cost, correct if stale)
-* `refresh`: partial reset (safe but weaker)
-* `keep`: do nothing (efficient if data is fresh)
----
-### Hidden Dynamics
-The true cache state is **not directly observable**.
-Staleness depends on:
-* base TTL
-* update frequency
-* time since last update
-Observations are **noisy**, requiring inference.
----
-## 🎯 Tasks
-Three tasks with increasing difficulty:
-### 🟢 Easy
-* Few items
-* Low volatility
-* Clear signals
-### 🟡 Medium
-* Moderate noise
-* Conflicting signals
-* Requires reasoning
-### 🔴 Hard
-* High volatility
-* Frequent updates
-* Misleading signals
----
-## 🏆 Reward Function
-Reward is given at every step:
-| Action     | Correct Case | Reward |
-| ---------- | ------------ | ------ |
-| invalidate | stale        | +1.0   |
-| invalidate | fresh        | -0.5   |
-| keep       | fresh        | +0.8   |
-| keep       | stale        | -0.6   |
-| refresh    | stale        | +0.6   |
-| refresh    | fresh        | +0.2   |
-This provides:
-* dense feedback
-* partial credit
-* penalty for poor decisions
 ---
-## 📊 Episode
-* Fixed length: 10 steps
-* Final score: average reward (normalized to [0,1])
----
-## 🤖 Baseline Agent
-The baseline agent uses:
-* heuristic decision policy
-* short-term memory (to avoid repeated mistakes)
-* optional LLM reasoning
-### Example score
-| Task   | Score    |
-| ------ | -------- |
-| Easy   | ~4.5–6.5 |
-| Medium | ~3.5–5.5 |
-| Hard   | ~2.5–4.5 |
----
-## 🚀 Running the Environment
-### 1. Local
-```bash
-pip install -r requirements.txt
-uvicorn app:app --reload
-```
----
-### 2. API Endpoints
-#### Reset
-```bash
-curl -X POST http://localhost:8000/reset
-```
-#### Step
 ```bash
-curl -X POST http://localhost:8000/step \
-  -H "Content-Type: application/json" \
-  -d '{"type":"keep","key":"item_0"}'
 ```
-#### State
-```bash
-curl http://localhost:8000/state
-```
 ---
-## 🤗 Hugging Face Deployment
-Live endpoint:
-```
-https://parvpareek-cache-env.hf.space
-```
-Test:
 ```bash
-curl -X POST https://parvpareek-cache-env.hf.space/reset
 ```
 ---
-## 🐳 Docker
-```bash
-docker build -t cache-env .
-docker run -p 7860:7860 cache-env
-```
----
-## ⚙️ Environment Variables
-Required for inference:
 ```bash
-API_BASE_URL=<llm_endpoint>
-MODEL_NAME=<model_name>
-HF_TOKEN=<api_key>
 ```
 ---
-## 📁 Project Structure
-```
-.
-├── app.py
-├── env/
-│   ├── core.py
-│   ├── generator.py
-│   ├── grader.py
-│   ├── models.py
-│   └── tasks.py
-├── inference.py
-├── openenv.yaml
-├── Dockerfile
-└── README.md
-```
 ---
-## ✅ OpenEnv Compliance
-* ✔ step / reset / state API
-* ✔ typed models (Pydantic)
-* ✔ openenv.yaml included
-* ✔ 3 tasks with graders
-* ✔ reward ∈ [0,1]
-* ✔ deterministic evaluation
 ---
-## 💡 Key Insight
-This environment models:
-> Decision-making under uncertainty with partial observability
-Agents must infer:
-* when data is stale
-* when to act vs wait
----
-## 🧠 Why This Matters
-Cache invalidation is considered one of the hardest problems in computer science.
-This environment provides:
-* a controlled simulation
-* measurable evaluation
-* realistic constraints
----
-## 📌 Summary
-* Real-world system problem ✔
-* Multi-step decision making ✔
-* Partial observability ✔
-* Non-trivial reward shaping ✔
----
-## 👤 Author
-Built for OpenEnv evaluation challenge.

 pinned: false
 ---
+# Cache invalidation environment (OpenEnv)
+## For judges — what this is
+**Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth.
+**Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
+**Our approach:** We simulate several cache **items** per episode. Each item has hidden staleness dynamics (TTL, update rate). The API only exposes **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks an action **per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **grader** produces a **final score in [0, 1]** from correctness, wasted invalidations, and stability.
+**Tasks:** Three difficulties — **easy**, **medium**, **hard** — differ by number of items and how volatile hidden state is, so the same policy can be compared across noise levels.
 ---
+## API (OpenEnv-style HTTP)
+| Method | Path | Role |
+|--------|------|------|
+| POST | `/reset` | New episode; returns `state` and `task_id` |
+| POST | `/step` | JSON body `{"type":"keep\|refresh\|invalidate","key":"item_0"}`; returns `state`, `reward`, `done`, optional `final_score` when episode ends |
+| GET | `/state` | Current observation |
+**Deployed Space (example):** `https://parvpareek-cache-env.hf.space` — ping with:
 ```bash
+curl -s -o /dev/null -w '%{http_code}\n' -X POST \
+  -H 'Content-Type: application/json' -d '{}' \
+  'https://parvpareek-cache-env.hf.space/reset'
 ```
+Expect `200`.
+**Local run:** `pip install -r requirements.txt` then `uvicorn app:app --host 0.0.0.0 --port 7860` (or use the Dockerfile).
 ---
+## Baseline inference (`inference.py`)
+- Uses the **OpenAI Python client** with **`API_BASE_URL`**, **`MODEL_NAME`**, and **`HF_TOKEN`** (set as environment variables or in a local `.env` loaded by `inference.py`; never commit tokens).
+- Talks to the **Space URL** above (override with `ENV_URL` if needed).
+- Prints exactly **`[START]`**, one **`[STEP]`** per env step, and **`[END]`** with `score` and `rewards` as required by the challenge spec.
+Run:
 ```bash
+export API_BASE_URL='https://router.huggingface.co/v1'
+export MODEL_NAME='<model your account can call>'
+export HF_TOKEN='hf_...'
+python inference.py
 ```
 ---
+## Validation (pre-submission)
+From the repo root:
 ```bash
+openenv validate
+./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
+docker build .
 ```
 ---
+## Repository layout (high level)
+| Path | Purpose |
+|------|---------|
+| `app.py` | FastAPI app: `/reset`, `/step`, `/state` |
+| `env/` | Environment logic, tasks, grading, generation |
+| `openenv.yaml` | OpenEnv metadata |
+| `inference.py` | Baseline agent + structured logs |
+| `Dockerfile` | Space / CI image |
+| `pyproject.toml`, `uv.lock`, `server/app.py` | `openenv validate` / multi-mode layout |
 ---
+## Scoring (short)
+- **Per-step reward:** Shaped table (e.g. invalidate when stale is good; invalidate when fresh is penalized). Values can be negative in the middle of an episode.
+- **Episode `final_score` (when `done`):** Normalized grader in **[0, 1]** combining decision quality, unnecessary invalidations, and oscillation.
 ---
+## Summary
+| Criterion | Status |
+|-----------|--------|
+| Real-world task (not a toy game) | Cache invalidation under uncertainty |
+| `reset` / `step` / `state` | Implemented |
+| `openenv.yaml` | Present |
+| 3 tasks + grader | `easy` / `medium` / `hard` |
+| Meaningful rewards | Dense step reward + episode score in [0, 1] |
+| Baseline | `inference.py` + OpenAI client + stdout format |
+If anything fails in automated checks, compare your **Space app URL** (`*.hf.space`) and **pushed commit** to what you submit.