Parv Pareek commited on
Commit
6c66cc1
·
1 Parent(s): 32ec139

update: add readme

Browse files
Files changed (1) hide show
  1. README.md +54 -257
README.md CHANGED
@@ -7,303 +7,100 @@ sdk: docker
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
11
 
 
12
 
13
- # 🧠 Cache Invalidation Environment (OpenEnv)
14
 
15
- ## 📌 Overview
16
 
17
- This project implements a **real-world cache invalidation decision environment** using the OpenEnv specification.
18
 
19
- Cache invalidation is a fundamental systems problem: deciding **when to refresh cached data vs reuse it**. Acting too early wastes resources, while acting too late serves stale data.
20
-
21
- This environment simulates that tradeoff under **uncertainty and noisy signals**, allowing evaluation of agent decision-making.
22
-
23
- ---
24
-
25
- ## 🎯 Motivation
26
-
27
- Cache invalidation is widely used in:
28
-
29
- * Distributed systems
30
- * Web backends
31
- * CDNs and edge caching
32
- * Databases
33
-
34
- This environment models a **practical decision problem engineers face daily**, making it useful for evaluating reasoning-based agents.
35
-
36
- ---
37
-
38
- ## 🧩 Environment Design
39
-
40
- ### State (Observation)
41
-
42
- Each step returns:
43
-
44
- ```json
45
- {
46
- "items": [
47
- {
48
- "key": "item_0",
49
- "age": 5,
50
- "access_count": 12,
51
- "last_result": "hit"
52
- }
53
- ],
54
- "step": 3,
55
- "task_id": "medium"
56
- }
57
- ```
58
-
59
- #### Field meanings:
60
-
61
- * `age`: time since last refresh
62
- * `access_count`: usage frequency
63
- * `last_result`: "hit" or "stale" (noisy signal)
64
- * `task_id`: difficulty level
65
-
66
- ---
67
-
68
- ### Actions
69
-
70
- Agent must return:
71
-
72
- ```json
73
- {
74
- "type": "invalidate | refresh | keep",
75
- "key": "item_id"
76
- }
77
- ```
78
-
79
- #### Action meanings:
80
-
81
- * `invalidate`: reset cache (high cost, correct if stale)
82
- * `refresh`: partial reset (safe but weaker)
83
- * `keep`: do nothing (efficient if data is fresh)
84
-
85
- ---
86
-
87
- ### Hidden Dynamics
88
-
89
- The true cache state is **not directly observable**.
90
-
91
- Staleness depends on:
92
-
93
- * base TTL
94
- * update frequency
95
- * time since last update
96
-
97
- Observations are **noisy**, requiring inference.
98
-
99
- ---
100
-
101
- ## 🎯 Tasks
102
-
103
- Three tasks with increasing difficulty:
104
-
105
- ### 🟢 Easy
106
-
107
- * Few items
108
- * Low volatility
109
- * Clear signals
110
-
111
- ### 🟡 Medium
112
-
113
- * Moderate noise
114
- * Conflicting signals
115
- * Requires reasoning
116
-
117
- ### 🔴 Hard
118
-
119
- * High volatility
120
- * Frequent updates
121
- * Misleading signals
122
-
123
- ---
124
-
125
- ## 🏆 Reward Function
126
-
127
- Reward is given at every step:
128
-
129
- | Action | Correct Case | Reward |
130
- | ---------- | ------------ | ------ |
131
- | invalidate | stale | +1.0 |
132
- | invalidate | fresh | -0.5 |
133
- | keep | fresh | +0.8 |
134
- | keep | stale | -0.6 |
135
- | refresh | stale | +0.6 |
136
- | refresh | fresh | +0.2 |
137
-
138
- This provides:
139
-
140
- * dense feedback
141
- * partial credit
142
- * penalty for poor decisions
143
 
144
  ---
145
 
146
- ## 📊 Episode
147
-
148
- * Fixed length: 10 steps
149
- * Final score: average reward (normalized to [0,1])
150
-
151
- ---
152
 
153
- ## 🤖 Baseline Agent
 
 
 
 
154
 
155
- The baseline agent uses:
156
-
157
- * heuristic decision policy
158
- * short-term memory (to avoid repeated mistakes)
159
- * optional LLM reasoning
160
-
161
- ### Example score
162
-
163
- | Task | Score |
164
- | ------ | -------- |
165
- | Easy | ~4.5–6.5 |
166
- | Medium | ~3.5–5.5 |
167
- | Hard | ~2.5–4.5 |
168
-
169
- ---
170
-
171
- ## 🚀 Running the Environment
172
-
173
- ### 1. Local
174
-
175
- ```bash
176
- pip install -r requirements.txt
177
- uvicorn app:app --reload
178
- ```
179
-
180
- ---
181
-
182
- ### 2. API Endpoints
183
-
184
- #### Reset
185
-
186
- ```bash
187
- curl -X POST http://localhost:8000/reset
188
- ```
189
-
190
- #### Step
191
 
192
  ```bash
193
- curl -X POST http://localhost:8000/step \
194
- -H "Content-Type: application/json" \
195
- -d '{"type":"keep","key":"item_0"}'
196
  ```
197
 
198
- #### State
199
 
200
- ```bash
201
- curl http://localhost:8000/state
202
- ```
203
 
204
  ---
205
 
206
- ## 🤗 Hugging Face Deployment
207
-
208
- Live endpoint:
209
 
210
- ```
211
- https://parvpareek-cache-env.hf.space
212
- ```
213
 
214
- Test:
215
 
216
  ```bash
217
- curl -X POST https://parvpareek-cache-env.hf.space/reset
 
 
 
218
  ```
219
 
220
  ---
221
 
222
- ## 🐳 Docker
223
 
224
- ```bash
225
- docker build -t cache-env .
226
- docker run -p 7860:7860 cache-env
227
- ```
228
-
229
- ---
230
-
231
- ## ⚙️ Environment Variables
232
-
233
- Required for inference:
234
 
235
  ```bash
236
- API_BASE_URL=<llm_endpoint>
237
- MODEL_NAME=<model_name>
238
- HF_TOKEN=<api_key>
239
  ```
240
 
241
  ---
242
 
243
- ## 📁 Project Structure
244
 
245
- ```
246
- .
247
- ├── app.py
248
- ├── env/
249
- │ ├── core.py
250
- │ ├── generator.py
251
- │ ├── grader.py
252
- │ ├── models.py
253
- │ └── tasks.py
254
- ├── inference.py
255
- ├── openenv.yaml
256
- ├── Dockerfile
257
- └── README.md
258
- ```
259
 
260
  ---
261
 
262
- ## OpenEnv Compliance
263
 
264
- * step / reset / state API
265
- * typed models (Pydantic)
266
- * ✔ openenv.yaml included
267
- * ✔ 3 tasks with graders
268
- * ✔ reward ∈ [0,1]
269
- * ✔ deterministic evaluation
270
 
271
  ---
272
 
273
- ## 💡 Key Insight
274
-
275
- This environment models:
276
-
277
- > Decision-making under uncertainty with partial observability
278
-
279
- Agents must infer:
280
-
281
- * when data is stale
282
- * when to act vs wait
283
-
284
- ---
285
-
286
- ## 🧠 Why This Matters
287
-
288
- Cache invalidation is considered one of the hardest problems in computer science.
289
-
290
- This environment provides:
291
-
292
- * a controlled simulation
293
- * measurable evaluation
294
- * realistic constraints
295
-
296
- ---
297
-
298
- ## 📌 Summary
299
-
300
- * Real-world system problem ✔
301
- * Multi-step decision making ✔
302
- * Partial observability ✔
303
- * Non-trivial reward shaping ✔
304
-
305
- ---
306
 
307
- ## 👤 Author
 
 
 
 
 
 
 
308
 
309
- Built for OpenEnv evaluation challenge.
 
7
  pinned: false
8
  ---
9
 
10
+ # Cache invalidation environment (OpenEnv)
11
 
12
+ ## For judges — what this is
13
 
14
+ **Problem in one sentence:** Backends cache data to go fast; they must decide **when to invalidate, softly refresh, or leave cache alone** using **noisy clues** (like real monitoring), not the ground truth.
15
 
16
+ **Why it matters:** Cache invalidation is a daily systems tradeoff: act too often and you burn CPU and churn storage; act too late and users see stale data. This env turns that into a **short episode** an agent can be scored on.
17
 
18
+ **Our approach:** We simulate several cache **items** per episode. Each item has hidden staleness dynamics (TTL, update rate). The API only exposes **observable** fields (`age`, `access_count`, `last_result` as hit/stale with noise). The agent picks an action **per step** for one key: `invalidate`, `refresh`, or `keep`. Step rewards give **partial credit**; at episode end a **grader** produces a **final score in [0, 1]** from correctness, wasted invalidations, and stability.
19
 
20
+ **Tasks:** Three difficulties **easy**, **medium**, **hard** differ by number of items and how volatile hidden state is, so the same policy can be compared across noise levels.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ---
23
 
24
+ ## API (OpenEnv-style HTTP)
 
 
 
 
 
25
 
26
+ | Method | Path | Role |
27
+ |--------|------|------|
28
+ | POST | `/reset` | New episode; returns `state` and `task_id` |
29
+ | POST | `/step` | JSON body `{"type":"keep\|refresh\|invalidate","key":"item_0"}`; returns `state`, `reward`, `done`, optional `final_score` when episode ends |
30
+ | GET | `/state` | Current observation |
31
 
32
+ **Deployed Space (example):** `https://parvpareek-cache-env.hf.space` — ping with:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ```bash
35
+ curl -s -o /dev/null -w '%{http_code}\n' -X POST \
36
+ -H 'Content-Type: application/json' -d '{}' \
37
+ 'https://parvpareek-cache-env.hf.space/reset'
38
  ```
39
 
40
+ Expect `200`.
41
 
42
+ **Local run:** `pip install -r requirements.txt` then `uvicorn app:app --host 0.0.0.0 --port 7860` (or use the Dockerfile).
 
 
43
 
44
  ---
45
 
46
+ ## Baseline inference (`inference.py`)
 
 
47
 
48
+ - Uses the **OpenAI Python client** with **`API_BASE_URL`**, **`MODEL_NAME`**, and **`HF_TOKEN`** (set as environment variables or in a local `.env` loaded by `inference.py`; never commit tokens).
49
+ - Talks to the **Space URL** above (override with `ENV_URL` if needed).
50
+ - Prints exactly **`[START]`**, one **`[STEP]`** per env step, and **`[END]`** with `score` and `rewards` as required by the challenge spec.
51
 
52
+ Run:
53
 
54
  ```bash
55
+ export API_BASE_URL='https://router.huggingface.co/v1'
56
+ export MODEL_NAME='<model your account can call>'
57
+ export HF_TOKEN='hf_...'
58
+ python inference.py
59
  ```
60
 
61
  ---
62
 
63
+ ## Validation (pre-submission)
64
 
65
+ From the repo root:
 
 
 
 
 
 
 
 
 
66
 
67
  ```bash
68
+ openenv validate
69
+ ./validate-submission.sh 'https://YOUR-SPACE.hf.space' .
70
+ docker build .
71
  ```
72
 
73
  ---
74
 
75
+ ## Repository layout (high level)
76
 
77
+ | Path | Purpose |
78
+ |------|---------|
79
+ | `app.py` | FastAPI app: `/reset`, `/step`, `/state` |
80
+ | `env/` | Environment logic, tasks, grading, generation |
81
+ | `openenv.yaml` | OpenEnv metadata |
82
+ | `inference.py` | Baseline agent + structured logs |
83
+ | `Dockerfile` | Space / CI image |
84
+ | `pyproject.toml`, `uv.lock`, `server/app.py` | `openenv validate` / multi-mode layout |
 
 
 
 
 
 
85
 
86
  ---
87
 
88
+ ## Scoring (short)
89
 
90
+ - **Per-step reward:** Shaped table (e.g. invalidate when stale is good; invalidate when fresh is penalized). Values can be negative in the middle of an episode.
91
+ - **Episode `final_score` (when `done`):** Normalized grader in **[0, 1]** combining decision quality, unnecessary invalidations, and oscillation.
 
 
 
 
92
 
93
  ---
94
 
95
+ ## Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
+ | Criterion | Status |
98
+ |-----------|--------|
99
+ | Real-world task (not a toy game) | Cache invalidation under uncertainty |
100
+ | `reset` / `step` / `state` | Implemented |
101
+ | `openenv.yaml` | Present |
102
+ | 3 tasks + grader | `easy` / `medium` / `hard` |
103
+ | Meaningful rewards | Dense step reward + episode score in [0, 1] |
104
+ | Baseline | `inference.py` + OpenAI client + stdout format |
105
 
106
+ If anything fails in automated checks, compare your **Space app URL** (`*.hf.space`) and **pushed commit** to what you submit.