immortalindeed commited on
Commit
cff7056
Β·
1 Parent(s): cfda61e

docs: clean up README for public hackathon submission (hide internal scoring formulas)

Browse files
Files changed (1) hide show
  1. README.md +147 -268
README.md CHANGED
@@ -7,41 +7,44 @@ sdk: docker
7
  app_port: 7860
8
  ---
9
 
10
- # πŸ› οΈ EntropyEnv: Multi-Agent Dev Tools Environment
11
 
12
  > A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**.
13
  > Built for the **Scaler Γ— Meta Γ— PyTorch Γ— Hugging Face OpenEnv Hackathon 2026**.
14
 
 
 
 
 
 
15
  ---
16
 
17
  ## πŸ’‘ Why This Environment?
18
 
19
- Most existing RL benchmarks test agents on **static, single-turn tasks** β€” classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**:
20
 
21
  - A security reviewer doesn't just find a bug β€” they **identify β†’ propose a fix β†’ revise after feedback**
22
  - A DevOps engineer doesn't just flag outdated packages β€” they **resolve version conflicts across an entire dependency graph**
23
  - A clinical coordinator doesn't just spot missing steps β€” they **prioritize by urgency and plan a dependency-safe recovery**
24
 
25
- **No existing RL environment tests agents on this full identify β†’ act β†’ revise cycle.** This environment fills that gap by providing 9 tasks across 3 real-world domains with progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
26
-
27
- **Who would use this?** Teams training AI coding assistants (code review bots), dependency management agents (Dependabot-like systems), and clinical decision support systems.
28
 
29
  ---
30
 
31
  ## 🎯 What Is This?
32
 
33
- ![Gradio UI Run History](docs/screenshot.png)
34
 
35
- This is a **training gym for AI agents** β€” not the agent itself.
36
- Think of it like a driving test course: you build the course, and different AI "drivers" take the test.
37
 
38
- An AI agent connects to this environment via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** (0.0 – 1.0) based on how good the answer is.
39
 
40
  ```
41
  POST /reset
42
- AI Agent ────────────────────────► This Environment
43
  β”‚
44
- β”œβ”€β”€ Picks a task case
45
  β”œβ”€β”€ Returns: observation (the problem)
46
  ◄──────────────────────── β”‚
47
  β”‚
@@ -60,331 +63,207 @@ AI Agent ───────────────────────
60
 
61
  ### πŸ”’ Domain 1: MCP Security Auditing
62
 
63
- Agents must identify vulnerabilities in code snippets, propose fixes, and iteratively revise based on reviewer feedback.
64
 
65
- | Task | Difficulty | Subtype | Max Steps | Threshold | Actions |
66
- |------|-----------|---------|-----------|-----------|---------|
67
- | `sec_easy` | Easy | `single` | 4 | 0.80 | `identify_vulnerability` |
68
- | `sec_medium` | Medium | `multi` | 6 | 0.75 | `identify` β†’ `propose_fix` β†’ `revise_fix` |
69
- | `sec_hard` | Hard | `adversarial` | 8 | 0.70 | `identify` β†’ `propose_fix` β†’ `revise_fix` (reviewer) |
70
 
71
- **Dataset:** 13 ground-truth cases covering SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE.
72
 
73
  ### πŸ“¦ Domain 2: PyTorch Migration Time-Machine
74
 
75
- Agents must detect deprecated APIs, resolve version conflicts, and fix `torch.compile` graph-break patterns.
76
 
77
- | Task | Difficulty | Subtype | Max Steps | Threshold | Actions |
78
- |------|-----------|---------|-----------|-----------|---------|
79
- | `dep_easy` | Easy | `flag` | 4 | 0.80 | `flag_outdated` |
80
- | `dep_medium` | Medium | `resolve` | 6 | 0.75 | `resolve_conflict` |
81
- | `dep_hard` | Hard | `migrate` | 8 | 0.70 | `migrate_api` / `validate_tree` |
82
 
83
- **Dataset:** 13 ground-truth cases covering Variable, cuda(), DataParallel, ONNX export, torch.compile graph-breaks.
84
 
85
  ### πŸ₯ Domain 3: Clinical Workflow Chaos Simulator
86
 
87
- Agents must detect missing steps in hospital workflows, rank them by priority, and plan dependency-ordered recovery sequences.
88
 
89
- | Task | Difficulty | Max Steps | Threshold | Actions |
90
- |------|-----------|-----------|-----------|---------|
91
- | `cli_easy` | Easy | 4 | 0.80 | `detect_gap` |
92
- | `cli_medium` | Medium | 6 | 0.75 | `detect_gap` β†’ `rank_issues` |
93
- | `cli_hard` | Hard | 6 | 0.70 | `detect_gap` β†’ `rank_issues` β†’ `order_steps` |
94
 
95
- **Dataset:** 13 ground-truth cases covering surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code.
96
 
97
  ---
98
 
99
- ## πŸ“Š Observation & Action Spaces
100
-
101
- ### Observation Space
102
-
103
- Every observation includes these core fields:
104
-
105
- | Field | Type | Description |
106
- |-------|------|-------------|
107
- | `task_type` | `str` | Domain: `security`, `dependency`, or `clinical` |
108
- | `task_id` | `str` | Task identifier (e.g., `sec_easy`) |
109
- | `task_subtype` | `str` | Variant: `single`, `multi`, `flag`, `resolve`, `migrate` |
110
- | `task_description` | `str` | Human-readable problem description |
111
- | `available_actions` | `list[dict]` | Valid actions with parameter specs |
112
- | `turn` | `int` | Current step number |
113
- | `done` | `bool` | Whether episode has ended |
114
 
115
- Domain-specific fields are added (e.g., `code_snippet` for security, `compatibility_matrix` for dependency, `events` and `dependency_graph` for clinical).
116
-
117
- ### Action Space
118
-
119
- Actions are JSON objects with `action_type` and domain-specific parameters:
120
-
121
- ```json
122
- {"action_type": "identify_vulnerability", "vuln_type": "sql_injection", "cvss_score": 8.5, "severity": "critical", "affected_line": 3}
123
- {"action_type": "propose_fix", "fix_code": "db.execute(query, (param,))", "explanation": "Use parameterized queries"}
124
- {"action_type": "flag_outdated", "packages": {"torch": "1.9.0"}, "deprecated_api": "torch.autograd.Variable", "replacement": "plain tensor"}
125
- {"action_type": "detect_gap", "missing_steps": ["pre_op_consent"], "risk_level": "critical"}
126
- ```
127
 
128
  ---
129
 
130
- ## πŸ“Š Scoring System
131
 
132
- ### Two-Layer Grading Architecture
133
-
134
- **Layer 1: `base_grader.py`** β€” Universal reward pipeline applied to ALL domains:
135
-
136
- ```
137
- reward = safe_score(correctness + repetition_penalty + harmful_penalty + efficiency_bonus)
138
- ```
139
-
140
- | Component | Formula | Range |
141
- |-----------|---------|-------|
142
- | `compute_correctness()` | Domain-specific (see below) | 0.0 – 1.0 |
143
- | `repetition_penalty` | βˆ’0.15 Γ— count(same action in last 3 turns) | βˆ’0.45 – 0.0 |
144
- | `harmful_output_penalty` | βˆ’0.30 if forbidden pattern detected | βˆ’0.30 – 0.0 |
145
- | `efficiency_bonus` | +0.10 if `correctness >= 0.8` and early finish | 0.0 – 0.10 |
146
- | `safe_score()` | `clamp(score, 0.0, 1.0)` | 0.0 – 1.0 |
147
-
148
- **Layer 2: Domain-specific graders:**
149
-
150
- #### Security Grader
151
- | Action | Component | Weight |
152
- |--------|-----------|--------|
153
- | `identify_vulnerability` | vuln_type match | Γ—0.45 |
154
- | `identify_vulnerability` | CVSS in range (partial: Β±3.0) | Γ—0.30 |
155
- | `identify_vulnerability` | severity match (adjacent: Γ—0.40) | Γ—0.25 |
156
- | `propose_fix` | token coverage + identifier preserved (floor: 0.25) | up to 1.15 |
157
- | `revise_fix` | feedback keyword coverage βˆ’ regression (floor: 0.20) | 0.0 – 1.0 |
158
-
159
- #### Dependency Grader
160
- | Action | Formula |
161
- |--------|---------|
162
- | `flag_outdated` | F1 Γ— 0.55 + deprecated_api_match Γ— 0.45 |
163
- | `resolve_conflict` | valid_pkgs / conflict_count + tree_bonus(0.15) βˆ’ downgrade(0.10) |
164
- | `migrate_api` | order_score Γ— 0.30 + completeness Γ— 0.40 + fix_quality Γ— 0.30 |
165
-
166
- #### Clinical Grader
167
- | Action | Formula |
168
- |--------|---------|
169
- | `detect_gap` | F1(predicted, expected) Γ— 0.65 + risk_match Γ— 0.35 |
170
- | `rank_issues` | completeness Γ— 0.40 + NDCG@k Γ— 0.60 |
171
- | `order_steps` | order_violations Γ— 0.40 + completeness Γ— 0.40 + efficiency Γ— 0.20 |
172
-
173
- ### GRPO Training Signal Quality
174
-
175
- This environment is specifically designed for **Group Relative Policy Optimization**:
176
-
177
- - **Smooth reward ramp** β€” Scores transition smoothly from 0.0 β†’ 1.0, never binary
178
- - **Partial credit everywhere** β€” F1 scoring, NDCG ranking, adjacent-severity credit
179
- - **Progressive penalty learning** β€” Schema penalty (βˆ’0.20), repetition (βˆ’0.15), harmful (βˆ’0.30)
180
- - **Efficiency bonus** β€” Agents learn to solve faster by finishing early
181
- - **Floor scores** β€” Valid workflow attempts always get minimum credit (0.20–0.25)
182
 
183
- ---
184
 
185
- ## πŸ” Validation (3 Stages)
 
186
 
187
- Every action goes through 3-stage validation before reaching the grader:
 
 
 
 
188
 
189
- 1. **Schema** β€” Required fields present? Correct types? (Auto-casts `"8.5"` β†’ `8.5`)
190
- 2. **Domain** β€” Is `vuln_type` in the valid set? Is `cvss_score` in [0, 10]?
191
- 3. **Consistency** β€” Is `revise_fix` called after `propose_fix`? No identical repeats?
192
 
193
- If validation fails, the agent gets a **rich feedback observation** (not just 0.0):
194
- ```json
195
- {
196
- "validation_failed": true,
197
- "error_type": "domain_error",
198
- "message": "cvss_score 12.5 out of range",
199
- "hint": "cvss_score must be a float between 0.0 and 10.0",
200
- "available_actions": ["identify_vulnerability", "propose_fix", "revise_fix"]
201
  }
202
- ```
203
-
204
- ---
205
-
206
- ## πŸ›οΈ Architecture
207
 
208
- ```
209
- project-root/
210
- β”œβ”€β”€ inference.py # Baseline agent (OpenAI-compatible, spec-compliant logs)
211
- β”œβ”€β”€ openenv.yaml # OpenEnv manifest (9 tasks declared)
212
- β”œβ”€β”€ pyproject.toml # Python package config with openenv-core dependency
213
- β”œβ”€β”€ Dockerfile # Docker build for HF Spaces (port 7860)
214
- β”œβ”€β”€ server/
215
- β”‚ β”œβ”€β”€ app.py # FastAPI endpoints: /, /reset, /step, /state, /debug
216
- β”‚ β”œβ”€β”€ router.py # Central dispatcher: observations, done conditions, score_details
217
- β”‚ β”œβ”€β”€ session.py # In-memory session state management
218
- β”‚ β”œβ”€β”€ benchmark_store.py # Persistent JSON results store (survives restarts)
219
- β”‚ β”œβ”€β”€ demo_agent.py # Rule-based demo agent for Gradio UI
220
- β”‚ β”œβ”€β”€ web_ui.py # Gradio UI with task runner and history
221
- β”‚ β”œβ”€β”€ debug_panel.html # Interactive HTML debug panel
222
- β”‚ β”œβ”€β”€ validation/
223
- β”‚ β”‚ └── validator.py # 3-stage validation: Schema β†’ Domain β†’ Consistency
224
- β”‚ β”œβ”€β”€ graders/
225
- β”‚ β”‚ β”œβ”€β”€ base_grader.py # safe_score, grade_dynamic, penalties, bonuses
226
- β”‚ β”‚ β”œβ”€β”€ security_grader.py # Vuln detection, fix quality, feedback coverage
227
- β”‚ β”‚ β”œβ”€β”€ dependency_grader.py # F1 scoring, version checking, graph ordering
228
- β”‚ β”‚ └── clinical_grader.py # F1, NDCG ranking, dependency-violation counting
229
- β”‚ └── datasets/
230
- β”‚ β”œβ”€β”€ security_cases.py # 13 cases: SQL injection, XSS, IDOR, SSRF, XXE, etc.
231
- β”‚ β”œβ”€β”€ dependency_cases.py # 13 cases: Variable, cuda(), DataParallel, graph-breaks
232
- β”‚ └── clinical_cases.py # 13 cases: surgery prep, ER triage, chemo, cardiac, transplant
233
- └── results/
234
- └── run_history.json # Persistent benchmark results (auto-created)
235
  ```
236
 
237
  ---
238
 
239
- ## πŸ“‘ API Endpoints
240
 
241
- | Method | Path | Description |
242
- |--------|------|-------------|
243
- | `GET /` | Health check | Returns status, task list, spec version |
244
- | `POST /reset` | Start episode | `{"task_id": "sec_easy"}` β†’ `{episode_id, observation}` |
245
- | `POST /step` | Submit action | `{episode_id, action_type, ...}` β†’ `{reward, done, observation}` |
246
- | `GET /state` | Query state | `?episode_id=xxx` β†’ `{step_count, done, reward_acc}` |
247
- | `GET /debug` | Debug panel | Interactive HTML benchmark runner |
248
- | `GET /web` | Gradio UI | Full task browser with run history |
249
 
250
- ### Step Response Format
251
-
252
- ```json
253
- {
254
- "episode_id": "uuid-string",
255
- "step_count": 2,
256
- "reward": 0.75,
257
- "done": false,
258
- "observation": {
259
- "task_type": "security",
260
- "task_id": "sec_easy",
261
- "task_subtype": "single",
262
- "task_description": "Identify the SQL injection vulnerability...",
263
- "turn": 1,
264
- "done": false,
265
- "available_actions": [...]
266
- },
267
- "score_details": {
268
- "vuln_type_match": 1.0,
269
- "cvss_in_range": 1.0,
270
- "severity_match": 0.0
271
- }
272
- }
273
- ```
274
 
275
- ---
 
 
276
 
277
- ## πŸš€ Setup & Running
278
 
279
- ### Prerequisites
280
- - Python 3.10+
281
- - `pip install fastapi uvicorn openai requests packaging gradio python-dotenv`
 
282
 
283
- ### Running Locally
284
 
285
  ```bash
286
- # 1. Start the environment server
287
- cd multi-agent-dev-tools-env
288
- uvicorn server.app:app --host 0.0.0.0 --port 7860
289
-
290
- # 2. Run baseline inference (in another terminal)
291
  export API_BASE_URL="https://router.huggingface.co/v1"
292
  export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
293
  export HF_TOKEN="your_token_here"
294
  export ENV_URL="http://localhost:7860"
295
- python inference.py
296
- ```
297
-
298
- ### Docker
299
 
300
- ```bash
301
- docker build -t multi-agent-dev-tools-env .
302
- docker run -p 7860:7860 multi-agent-dev-tools-env
303
  ```
304
 
305
  ### Deploy to Hugging Face Spaces
306
 
307
  ```bash
308
  huggingface-cli login
309
- openenv push --repo-id <username>/multi-agent-dev-tools-env
310
  ```
311
 
312
  ---
313
 
314
- ## πŸ“ Mandatory Log Format
315
-
316
- The `inference.py` emits structured stdout logs matching the spec exactly:
317
 
318
  ```
319
- [START] task=sec_easy env=multi-agent-dev-tools-env model=Qwen/Qwen2.5-72B-Instruct
320
- [STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
321
- [STEP] step=2 action=propose_fix reward=1.00 done=true error=null
322
- [END] success=true steps=2 score=1.00 rewards=0.85,1.00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
323
  ```
324
 
325
- ### Environment Variables (Required)
326
 
327
- | Variable | Description | Example |
328
- |----------|-------------|---------|
329
- | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
330
- | `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
331
- | `HF_TOKEN` | API key / HF token | `hf_xxxxx` or `gsk_xxxxx` |
332
- | `ENV_URL` | Environment URL | `http://localhost:7860` |
333
 
334
- ---
 
 
 
 
 
 
335
 
336
- ## πŸ“ˆ Baseline Scores
 
 
337
 
338
- Tested with multiple model families for universal compatibility:
339
 
340
- | Model | Family | Parameters | Average Score |
341
- |-------|--------|------------|---------------|
342
- | Llama 3.3 70B | Meta | 70B | **0.87** |
343
- | Qwen3-32B | Alibaba | 32B | **0.89** |
344
- | DeepSeek V3.2 | DeepSeek | MoE | **0.86** |
345
 
346
- The environment provides smooth reward gradients that enable GRPO training of smaller models (8B+).
 
 
 
 
 
347
 
348
  ---
349
 
350
- ## πŸ”§ Key Design Decisions
351
 
352
- 1. **Data-driven done conditions** β€” `completion_threshold` and `required_sequence` stored per case
353
- 2. **Universal model compatibility** β€” Strips `<think>`, `<reasoning>`, `<antThinking>` etc.
354
- 3. **Type-casting validator** β€” Auto-converts `"8.5"` β†’ `8.5` before rejecting
355
- 4. **Floor scores** β€” Valid workflow attempts always get minimum credit
356
- 5. **Deterministic case selection** β€” `hash(episode_id) % len(cases)` for reproducibility
357
- 6. **Compatibility matrix separation** β€” Prevents context truncation for large observations
358
- 7. **Patch-level version fuzzy** β€” `2.1.1` matches `2.1.0` by major.minor
359
- 8. **Hallucination filter** β€” `_score_rank` filters step IDs not in `available_steps`
360
- 9. **Persistent results** β€” `benchmark_store.py` writes to disk, survives restarts
361
- 10. **Robust dependency fallback** β€” Works without `packaging` module via manual version parsing
362
 
363
  ---
364
 
365
- ## β˜‘οΈ Compliance Checklist
366
-
367
- ### Phase 1: Automated Validation (Pass/Fail)
368
- - [x] HF Space deploys and responds to `GET /`
369
- - [x] `openenv.yaml` present with all 9 task IDs
370
- - [x] `POST /reset` returns `episode_id` + `observation` for all 9 tasks
371
- - [x] `POST /step` returns `reward` (float, 0.0–1.0) + `done` (bool) + `observation`
372
- - [x] `GET /state` returns episode state
373
- - [x] All endpoints return HTTP 200 (never 500)
374
- - [x] `Dockerfile` at project root, builds cleanly
375
- - [x] `inference.py` at project root, runs under 20 min
376
- - [x] `openenv validate` passes
377
-
378
- ### Phase 2: Agentic Evaluation (Scored)
379
- - [x] Observations include `task_type`, `task_subtype`, `task_description`, `available_actions`
380
- - [x] Partial credit graders (F1, NDCG, weighted sub-scores) β€” not binary
381
- - [x] Score variance across 9 tasks (varied difficulty = varied scores)
382
- - [x] `score_details` in step response for grading transparency
383
- - [x] `safe_score()` clamps all rewards to [0.0, 1.0]
384
-
385
- ### Phase 3: Human Review
386
- - [x] 3 real-world domains (security, dependency, clinical)
387
- - [x] Multi-turn iterative workflows (identify β†’ fix β†’ revise)
388
- - [x] Rich validation hints for agent learning
389
- - [x] Debug panel with benchmark runner UI
390
- - [x] GRPO-compatible reward shaping
 
7
  app_port: 7860
8
  ---
9
 
10
+ # πŸŒ€ EntropyEnv β€” Multi-Agent Dev Tools Environment
11
 
12
  > A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**.
13
  > Built for the **Scaler Γ— Meta Γ— PyTorch Γ— Hugging Face OpenEnv Hackathon 2026**.
14
 
15
+ [![OpenEnv Compatible](https://img.shields.io/badge/OpenEnv-v1-blue)](https://huggingface.co/docs/openenv)
16
+ [![Tasks](https://img.shields.io/badge/Tasks-9-green)](https://huggingface.co/spaces/immortalindeed/EntropyEnv)
17
+ [![Domains](https://img.shields.io/badge/Domains-3-purple)]()
18
+ [![Cases](https://img.shields.io/badge/Ground--Truth%20Cases-39-orange)]()
19
+
20
  ---
21
 
22
  ## πŸ’‘ Why This Environment?
23
 
24
+ Most RL benchmarks test agents on **static, single-turn tasks** β€” classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**:
25
 
26
  - A security reviewer doesn't just find a bug β€” they **identify β†’ propose a fix β†’ revise after feedback**
27
  - A DevOps engineer doesn't just flag outdated packages β€” they **resolve version conflicts across an entire dependency graph**
28
  - A clinical coordinator doesn't just spot missing steps β€” they **prioritize by urgency and plan a dependency-safe recovery**
29
 
30
+ **No existing RL environment tests agents on this full identify β†’ act β†’ revise cycle.** EntropyEnv fills that gap with 9 tasks across 3 real-world domains, progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
 
 
31
 
32
  ---
33
 
34
  ## 🎯 What Is This?
35
 
36
+ ![EntropyEnv Gradio UI](docs/screenshot.png)
37
 
38
+ EntropyEnv is a **training gym for AI agents** β€” not the agent itself.
39
+ Think of it like a driving test course: we build the course, and different AI "drivers" take the test.
40
 
41
+ An AI agent connects via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** based on how good the answer is.
42
 
43
  ```
44
  POST /reset
45
+ AI Agent ────────────────────────► EntropyEnv
46
  β”‚
47
+ β”œβ”€β”€ Picks a task case from the dataset
48
  β”œβ”€β”€ Returns: observation (the problem)
49
  ◄──────────────────────── β”‚
50
  β”‚
 
63
 
64
  ### πŸ”’ Domain 1: MCP Security Auditing
65
 
66
+ Agents identify vulnerabilities in code snippets, propose secure fixes, and iteratively revise based on adversarial reviewer feedback.
67
 
68
+ | Task | Difficulty | What the Agent Does |
69
+ |------|-----------|---------------------|
70
+ | `sec_easy` | 🟒 Easy | Classify a single vulnerability (type, CVSS, severity) |
71
+ | `sec_medium` | 🟑 Medium | Identify β†’ propose a code fix |
72
+ | `sec_hard` | πŸ”΄ Hard | Identify β†’ fix β†’ revise with adversarial reviewer feedback |
73
 
74
+ **Coverage:** SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE
75
 
76
  ### πŸ“¦ Domain 2: PyTorch Migration Time-Machine
77
 
78
+ Agents detect deprecated APIs, resolve version conflicts using compatibility matrices, and fix `torch.compile` graph-break patterns in dependency order.
79
 
80
+ | Task | Difficulty | What the Agent Does |
81
+ |------|-----------|---------------------|
82
+ | `dep_easy` | 🟒 Easy | Flag outdated packages and deprecated API usage |
83
+ | `dep_medium` | 🟑 Medium | Resolve version conflicts across package constraints |
84
+ | `dep_hard` | πŸ”΄ Hard | Fix torch.compile graph-breaks in correct dependency order |
85
 
86
+ **Coverage:** Variable, cuda(), DataParallel, ONNX export, torch.compile, vmap, torch.export
87
 
88
  ### πŸ₯ Domain 3: Clinical Workflow Chaos Simulator
89
 
90
+ Agents detect missing steps in hospital workflows, rank them by clinical priority, and plan dependency-ordered recovery sequences.
91
 
92
+ | Task | Difficulty | What the Agent Does |
93
+ |------|-----------|---------------------|
94
+ | `cli_easy` | 🟒 Easy | Detect missing workflow steps and assess risk |
95
+ | `cli_medium` | 🟑 Medium | Detect gaps β†’ rank by clinical priority |
96
+ | `cli_hard` | πŸ”΄ Hard | Detect β†’ rank β†’ plan dependency-safe recovery |
97
 
98
+ **Coverage:** Surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code
99
 
100
  ---
101
 
102
+ ## ⚑ Key Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
+ | Feature | Description |
105
+ |---------|-------------|
106
+ | 🎯 **Partial-Credit Scoring** | F1, NDCG, weighted multi-component grading β€” not binary pass/fail |
107
+ | πŸ”„ **Multi-Turn Episodes** | Agents iterate through identify β†’ act β†’ revise workflows |
108
+ | πŸ›‘οΈ **3-Stage Validation** | Schema β†’ Domain β†’ Consistency checks with helpful error hints |
109
+ | πŸ“Š **Score Breakdown** | Per-component feedback in every step so agents learn *what* to improve |
110
+ | 🏎️ **Mastery Detection** | High-performing agents finish early β€” efficiency is rewarded |
111
+ | 🌐 **Universal LLM Support** | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) |
112
+ | 🐳 **Docker-Ready** | One-command deploy to Hugging Face Spaces |
113
+ | πŸ“ˆ **GRPO-Compatible** | Smooth reward gradients designed for policy optimization training |
 
 
114
 
115
  ---
116
 
117
+ ## πŸ“‘ API Reference
118
 
119
+ | Method | Path | Description |
120
+ |--------|------|-------------|
121
+ | `GET /` | Health check | Returns status and available tasks |
122
+ | `POST /reset` | Start episode | `{"task_id": "sec_easy"}` β†’ `{episode_id, observation}` |
123
+ | `POST /step` | Submit action | `{episode_id, action_type, ...}` β†’ `{reward, done, observation}` |
124
+ | `GET /state` | Query state | `?episode_id=xxx` β†’ current episode info |
125
+ | `GET /debug` | Debug panel | Interactive HTML benchmark runner |
126
+ | `GET /web` | Gradio UI | Full task browser with run history |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
 
128
+ ### Quick Example
129
 
130
+ ```python
131
+ import requests
132
 
133
+ # 1. Start an episode
134
+ resp = requests.post("http://localhost:7860/reset", json={"task_id": "sec_easy"})
135
+ data = resp.json()
136
+ episode_id = data["episode_id"]
137
+ observation = data["observation"]
138
 
139
+ print(observation["task_description"])
140
+ # β†’ "Identify the SQL injection vulnerability in this code snippet."
 
141
 
142
+ # 2. Send an action
143
+ action = {
144
+ "episode_id": episode_id,
145
+ "action_type": "identify_vulnerability",
146
+ "vuln_type": "sql_injection",
147
+ "cvss_score": 9.1,
148
+ "severity": "critical",
149
+ "affected_line": 3
150
  }
151
+ result = requests.post("http://localhost:7860/step", json=action).json()
 
 
 
 
152
 
153
+ print(f"Reward: {result['reward']}, Done: {result['done']}")
154
+ # β†’ Reward: 0.85, Done: true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
  ```
156
 
157
  ---
158
 
159
+ ## πŸš€ Getting Started
160
 
161
+ ### Run Locally
 
 
 
 
 
 
 
162
 
163
+ ```bash
164
+ # Install dependencies
165
+ pip install fastapi uvicorn openai requests packaging gradio python-dotenv
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
+ # Start the environment
168
+ uvicorn server.app:app --host 0.0.0.0 --port 7860
169
+ ```
170
 
171
+ ### Run with Docker
172
 
173
+ ```bash
174
+ docker build -t entropyenv .
175
+ docker run -p 7860:7860 entropyenv
176
+ ```
177
 
178
+ ### Run the Baseline Agent
179
 
180
  ```bash
 
 
 
 
 
181
  export API_BASE_URL="https://router.huggingface.co/v1"
182
  export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
183
  export HF_TOKEN="your_token_here"
184
  export ENV_URL="http://localhost:7860"
 
 
 
 
185
 
186
+ python inference.py
 
 
187
  ```
188
 
189
  ### Deploy to Hugging Face Spaces
190
 
191
  ```bash
192
  huggingface-cli login
193
+ openenv push --repo-id <username>/EntropyEnv
194
  ```
195
 
196
  ---
197
 
198
+ ## πŸ›οΈ Project Structure
 
 
199
 
200
  ```
201
+ entropyenv/
202
+ β”œβ”€β”€ inference.py # Baseline agent with smart prompt engineering
203
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest (9 tasks)
204
+ β”œβ”€β”€ pyproject.toml # Package configuration
205
+ β”œβ”€β”€ Dockerfile # Multi-stage Docker build
206
+ β”œβ”€β”€ server/
207
+ β”‚ β”œβ”€β”€ app.py # FastAPI server with rate limiting & session management
208
+ β”‚ β”œβ”€β”€ router.py # Task dispatcher with mastery detection
209
+ β”‚ β”œβ”€β”€ session.py # Episode state management
210
+ β”‚ β”œβ”€β”€ web_ui.py # Gradio UI with performance dashboard
211
+ β”‚ β”œβ”€β”€ demo_agent.py # Rule-based demo agent
212
+ β”‚ β”œβ”€β”€ benchmark_store.py # Persistent results storage
213
+ β”‚ β”œβ”€β”€ debug_panel.html # Interactive debug interface
214
+ β”‚ β”œβ”€β”€ validation/
215
+ β”‚ β”‚ └── validator.py # 3-stage validation with type-casting
216
+ β”‚ β”œβ”€β”€ graders/
217
+ β”‚ β”‚ β”œβ”€β”€ base_grader.py # Universal reward pipeline
218
+ β”‚ β”‚ β”œβ”€β”€ security_grader.py # Security domain grader
219
+ β”‚ β”‚ β”œβ”€β”€ dependency_grader.py # Dependency domain grader
220
+ β”‚ β”‚ └── clinical_grader.py # Clinical domain grader
221
+ β”‚ └── datasets/
222
+ β”‚ β”œβ”€β”€ security_cases.py # 13 ground-truth security cases
223
+ β”‚ β”œβ”€β”€ dependency_cases.py # 13 ground-truth dependency cases
224
+ β”‚ └── clinical_cases.py # 13 ground-truth clinical cases
225
+ └── results/
226
+ └── run_history.json # Benchmark history (auto-created)
227
  ```
228
 
229
+ ---
230
 
231
+ ## πŸ“ˆ Baseline Performance
 
 
 
 
 
232
 
233
+ Tested across multiple model families to ensure universal compatibility:
234
+
235
+ | Model | Family | Average Score |
236
+ |-------|--------|---------------|
237
+ | Llama 3.3 70B | Meta | **0.87** |
238
+ | Qwen3-32B | Alibaba | **0.89** |
239
+ | DeepSeek V3.2 | DeepSeek | **0.86** |
240
 
241
+ The environment provides smooth reward gradients suitable for GRPO-based training of models as small as 8B parameters.
242
+
243
+ ---
244
 
245
+ ## πŸ“ Inference Log Format
246
 
247
+ The baseline `inference.py` emits structured logs matching the OpenEnv spec:
 
 
 
 
248
 
249
+ ```
250
+ [START] task=sec_easy env=multi-agent-dev-tools-env model=Qwen/Qwen2.5-72B-Instruct
251
+ [STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
252
+ [STEP] step=2 action=propose_fix reward=0.92 done=true error=null
253
+ [END] success=true steps=2 score=0.89 rewards=0.85,0.92
254
+ ```
255
 
256
  ---
257
 
258
+ ## 🀝 Built With
259
 
260
+ - **[FastAPI](https://fastapi.tiangolo.com/)** β€” High-performance async API framework
261
+ - **[Gradio](https://gradio.app/)** β€” Interactive web UI for testing and visualization
262
+ - **[PyTorch](https://pytorch.org/)** β€” Domain expertise for migration tasks
263
+ - **[OpenEnv](https://huggingface.co/docs/openenv)** β€” Standardized RL environment specification
 
 
 
 
 
 
264
 
265
  ---
266
 
267
+ <p align="center">
268
+ <b>Built with ❀️ for the Scaler Γ— Meta Γ— PyTorch Γ— Hugging Face OpenEnv Hackathon 2026</b>
269
+ </p>