Spaces:

Cooked4riyal
/

EntropyEnv

Running

App Files Files Community

immortalindeed commited on Apr 9

Commit

cff7056

1 Parent(s): cfda61e

docs: clean up README for public hackathon submission (hide internal scoring formulas)

Browse files

Files changed (1) hide show

README.md +147 -268

README.md CHANGED Viewed

@@ -7,41 +7,44 @@ sdk: docker
 app_port: 7860
 ---
-# 🛠️ EntropyEnv: Multi-Agent Dev Tools Environment
 > A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**.
 > Built for the **Scaler × Meta × PyTorch × Hugging Face OpenEnv Hackathon 2026**.
 ---
 ## 💡 Why This Environment?
-Most existing RL benchmarks test agents on **static, single-turn tasks** — classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**:
 - A security reviewer doesn't just find a bug — they **identify → propose a fix → revise after feedback**
 - A DevOps engineer doesn't just flag outdated packages — they **resolve version conflicts across an entire dependency graph**
 - A clinical coordinator doesn't just spot missing steps — they **prioritize by urgency and plan a dependency-safe recovery**
-**No existing RL environment tests agents on this full identify → act → revise cycle.** This environment fills that gap by providing 9 tasks across 3 real-world domains with progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
-**Who would use this?** Teams training AI coding assistants (code review bots), dependency management agents (Dependabot-like systems), and clinical decision support systems.
 ---
 ## 🎯 What Is This?
-![Gradio UI Run History](docs/screenshot.png)
-This is a **training gym for AI agents** — not the agent itself.
-Think of it like a driving test course: you build the course, and different AI "drivers" take the test.
-An AI agent connects to this environment via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** (0.0 – 1.0) based on how good the answer is.
 ```
                     POST /reset
-AI Agent  ────────────────────────►  This Environment
                                      │
-                                     ├── Picks a task case
                                      ├── Returns: observation (the problem)
           ◄────────────────────────  │
                                      │
@@ -60,331 +63,207 @@ AI Agent  ───────────────────────
 ### 🔒 Domain 1: MCP Security Auditing
-Agents must identify vulnerabilities in code snippets, propose fixes, and iteratively revise based on reviewer feedback.
-| Task | Difficulty | Subtype | Max Steps | Threshold | Actions |
-|------|-----------|---------|-----------|-----------|---------|
-| `sec_easy` | Easy | `single` | 4 | 0.80 | `identify_vulnerability` |
-| `sec_medium` | Medium | `multi` | 6 | 0.75 | `identify` → `propose_fix` → `revise_fix` |
-| `sec_hard` | Hard | `adversarial` | 8 | 0.70 | `identify` → `propose_fix` → `revise_fix` (reviewer) |
-**Dataset:** 13 ground-truth cases covering SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE.
 ### 📦 Domain 2: PyTorch Migration Time-Machine
-Agents must detect deprecated APIs, resolve version conflicts, and fix `torch.compile` graph-break patterns.
-| Task | Difficulty | Subtype | Max Steps | Threshold | Actions |
-|------|-----------|---------|-----------|-----------|---------|
-| `dep_easy` | Easy | `flag` | 4 | 0.80 | `flag_outdated` |
-| `dep_medium` | Medium | `resolve` | 6 | 0.75 | `resolve_conflict` |
-| `dep_hard` | Hard | `migrate` | 8 | 0.70 | `migrate_api` / `validate_tree` |
-**Dataset:** 13 ground-truth cases covering Variable, cuda(), DataParallel, ONNX export, torch.compile graph-breaks.
 ### 🏥 Domain 3: Clinical Workflow Chaos Simulator
-Agents must detect missing steps in hospital workflows, rank them by priority, and plan dependency-ordered recovery sequences.
-| Task | Difficulty | Max Steps | Threshold | Actions |
-|------|-----------|-----------|-----------|---------|
-| `cli_easy` | Easy | 4 | 0.80 | `detect_gap` |
-| `cli_medium` | Medium | 6 | 0.75 | `detect_gap` → `rank_issues` |
-| `cli_hard` | Hard | 6 | 0.70 | `detect_gap` → `rank_issues` → `order_steps` |
-**Dataset:** 13 ground-truth cases covering surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code.
 ---
-## 📊 Observation & Action Spaces
-### Observation Space
-Every observation includes these core fields:
-| Field | Type | Description |
-|-------|------|-------------|
-| `task_type` | `str` | Domain: `security`, `dependency`, or `clinical` |
-| `task_id` | `str` | Task identifier (e.g., `sec_easy`) |
-| `task_subtype` | `str` | Variant: `single`, `multi`, `flag`, `resolve`, `migrate` |
-| `task_description` | `str` | Human-readable problem description |
-| `available_actions` | `list[dict]` | Valid actions with parameter specs |
-| `turn` | `int` | Current step number |
-| `done` | `bool` | Whether episode has ended |
-Domain-specific fields are added (e.g., `code_snippet` for security, `compatibility_matrix` for dependency, `events` and `dependency_graph` for clinical).
-### Action Space
-Actions are JSON objects with `action_type` and domain-specific parameters:
-```json
-{"action_type": "identify_vulnerability", "vuln_type": "sql_injection", "cvss_score": 8.5, "severity": "critical", "affected_line": 3}
-{"action_type": "propose_fix", "fix_code": "db.execute(query, (param,))", "explanation": "Use parameterized queries"}
-{"action_type": "flag_outdated", "packages": {"torch": "1.9.0"}, "deprecated_api": "torch.autograd.Variable", "replacement": "plain tensor"}
-{"action_type": "detect_gap", "missing_steps": ["pre_op_consent"], "risk_level": "critical"}
-```
 ---
-## 📊 Scoring System
-### Two-Layer Grading Architecture
-**Layer 1: `base_grader.py`** — Universal reward pipeline applied to ALL domains:
-```
-reward = safe_score(correctness + repetition_penalty + harmful_penalty + efficiency_bonus)
-```
-| Component | Formula | Range |
-|-----------|---------|-------|
-| `compute_correctness()` | Domain-specific (see below) | 0.0 – 1.0 |
-| `repetition_penalty` | −0.15 × count(same action in last 3 turns) | −0.45 – 0.0 |
-| `harmful_output_penalty` | −0.30 if forbidden pattern detected | −0.30 – 0.0 |
-| `efficiency_bonus` | +0.10 if `correctness >= 0.8` and early finish | 0.0 – 0.10 |
-| `safe_score()` | `clamp(score, 0.0, 1.0)` | 0.0 – 1.0 |
-**Layer 2: Domain-specific graders:**
-#### Security Grader
-| Action | Component | Weight |
-|--------|-----------|--------|
-| `identify_vulnerability` | vuln_type match | ×0.45 |
-| `identify_vulnerability` | CVSS in range (partial: ±3.0) | ×0.30 |
-| `identify_vulnerability` | severity match (adjacent: ×0.40) | ×0.25 |
-| `propose_fix` | token coverage + identifier preserved (floor: 0.25) | up to 1.15 |
-| `revise_fix` | feedback keyword coverage − regression (floor: 0.20) | 0.0 – 1.0 |
-#### Dependency Grader
-| Action | Formula |
-|--------|---------|
-| `flag_outdated` | F1 × 0.55 + deprecated_api_match × 0.45 |
-| `resolve_conflict` | valid_pkgs / conflict_count + tree_bonus(0.15) − downgrade(0.10) |
-| `migrate_api` | order_score × 0.30 + completeness × 0.40 + fix_quality × 0.30 |
-#### Clinical Grader
-| Action | Formula |
-|--------|---------|
-| `detect_gap` | F1(predicted, expected) × 0.65 + risk_match × 0.35 |
-| `rank_issues` | completeness × 0.40 + NDCG@k × 0.60 |
-| `order_steps` | order_violations × 0.40 + completeness × 0.40 + efficiency × 0.20 |
-### GRPO Training Signal Quality
-This environment is specifically designed for **Group Relative Policy Optimization**:
-- **Smooth reward ramp** — Scores transition smoothly from 0.0 → 1.0, never binary
-- **Partial credit everywhere** — F1 scoring, NDCG ranking, adjacent-severity credit
-- **Progressive penalty learning** — Schema penalty (−0.20), repetition (−0.15), harmful (−0.30)
-- **Efficiency bonus** — Agents learn to solve faster by finishing early
-- **Floor scores** — Valid workflow attempts always get minimum credit (0.20–0.25)
----
-## 🔐 Validation (3 Stages)
-Every action goes through 3-stage validation before reaching the grader:
-1. **Schema** — Required fields present? Correct types? (Auto-casts `"8.5"` → `8.5`)
-2. **Domain** — Is `vuln_type` in the valid set? Is `cvss_score` in [0, 10]?
-3. **Consistency** — Is `revise_fix` called after `propose_fix`? No identical repeats?
-If validation fails, the agent gets a **rich feedback observation** (not just 0.0):
-```json
-{
-  "validation_failed": true,
-  "error_type": "domain_error",
-  "message": "cvss_score 12.5 out of range",
-  "hint": "cvss_score must be a float between 0.0 and 10.0",
-  "available_actions": ["identify_vulnerability", "propose_fix", "revise_fix"]
 }
-```
----
-## 🏛️ Architecture
-```
-project-root/
-├── inference.py                # Baseline agent (OpenAI-compatible, spec-compliant logs)
-├── openenv.yaml                # OpenEnv manifest (9 tasks declared)
-├── pyproject.toml              # Python package config with openenv-core dependency
-├── Dockerfile                  # Docker build for HF Spaces (port 7860)
-├── server/
-│   ├── app.py                  # FastAPI endpoints: /, /reset, /step, /state, /debug
-│   ├── router.py               # Central dispatcher: observations, done conditions, score_details
-│   ├── session.py              # In-memory session state management
-│   ├── benchmark_store.py      # Persistent JSON results store (survives restarts)
-│   ├── demo_agent.py           # Rule-based demo agent for Gradio UI
-│   ├── web_ui.py               # Gradio UI with task runner and history
-│   ├── debug_panel.html        # Interactive HTML debug panel
-│   ├── validation/
-│   │   └── validator.py        # 3-stage validation: Schema → Domain → Consistency
-│   ├── graders/
-│   │   ├── base_grader.py      # safe_score, grade_dynamic, penalties, bonuses
-│   │   ├── security_grader.py  # Vuln detection, fix quality, feedback coverage
-│   │   ├── dependency_grader.py # F1 scoring, version checking, graph ordering
-│   │   └── clinical_grader.py  # F1, NDCG ranking, dependency-violation counting
-│   └── datasets/
-│       ├── security_cases.py   # 13 cases: SQL injection, XSS, IDOR, SSRF, XXE, etc.
-│       ├── dependency_cases.py # 13 cases: Variable, cuda(), DataParallel, graph-breaks
-│       └── clinical_cases.py   # 13 cases: surgery prep, ER triage, chemo, cardiac, transplant
-└── results/
-    └── run_history.json        # Persistent benchmark results (auto-created)
 ```
 ---
-## 📡 API Endpoints
-| Method | Path | Description |
-|--------|------|-------------|
-| `GET /` | Health check | Returns status, task list, spec version |
-| `POST /reset` | Start episode | `{"task_id": "sec_easy"}` → `{episode_id, observation}` |
-| `POST /step` | Submit action | `{episode_id, action_type, ...}` → `{reward, done, observation}` |
-| `GET /state` | Query state | `?episode_id=xxx` → `{step_count, done, reward_acc}` |
-| `GET /debug` | Debug panel | Interactive HTML benchmark runner |
-| `GET /web` | Gradio UI | Full task browser with run history |
-### Step Response Format
-```json
-{
-  "episode_id": "uuid-string",
-  "step_count": 2,
-  "reward": 0.75,
-  "done": false,
-  "observation": {
-    "task_type": "security",
-    "task_id": "sec_easy",
-    "task_subtype": "single",
-    "task_description": "Identify the SQL injection vulnerability...",
-    "turn": 1,
-    "done": false,
-    "available_actions": [...]
-  },
-  "score_details": {
-    "vuln_type_match": 1.0,
-    "cvss_in_range": 1.0,
-    "severity_match": 0.0
-  }
-}
-```
----
-## 🚀 Setup & Running
-### Prerequisites
-- Python 3.10+
-- `pip install fastapi uvicorn openai requests packaging gradio python-dotenv`
-### Running Locally
 ```bash
-# 1. Start the environment server
-cd multi-agent-dev-tools-env
-uvicorn server.app:app --host 0.0.0.0 --port 7860
-# 2. Run baseline inference (in another terminal)
 export API_BASE_URL="https://router.huggingface.co/v1"
 export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
 export HF_TOKEN="your_token_here"
 export ENV_URL="http://localhost:7860"
-python inference.py
-```
-### Docker
-```bash
-docker build -t multi-agent-dev-tools-env .
-docker run -p 7860:7860 multi-agent-dev-tools-env
 ```
 ### Deploy to Hugging Face Spaces
 ```bash
 huggingface-cli login
-openenv push --repo-id <username>/multi-agent-dev-tools-env
 ```
 ---
-## 📝 Mandatory Log Format
-The `inference.py` emits structured stdout logs matching the spec exactly:
 ```
-[START] task=sec_easy env=multi-agent-dev-tools-env model=Qwen/Qwen2.5-72B-Instruct
-[STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
-[STEP] step=2 action=propose_fix reward=1.00 done=true error=null
-[END] success=true steps=2 score=1.00 rewards=0.85,1.00
 ```
-### Environment Variables (Required)
-| Variable | Description | Example |
-|----------|-------------|---------|
-| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
-| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
-| `HF_TOKEN` | API key / HF token | `hf_xxxxx` or `gsk_xxxxx` |
-| `ENV_URL` | Environment URL | `http://localhost:7860` |
----
-## 📈 Baseline Scores
-Tested with multiple model families for universal compatibility:
-| Model | Family | Parameters | Average Score |
-|-------|--------|------------|---------------|
-| Llama 3.3 70B | Meta | 70B | **0.87** |
-| Qwen3-32B | Alibaba | 32B | **0.89** |
-| DeepSeek V3.2 | DeepSeek | MoE | **0.86** |
-The environment provides smooth reward gradients that enable GRPO training of smaller models (8B+).
 ---
-## 🔧 Key Design Decisions
-1. **Data-driven done conditions** — `completion_threshold` and `required_sequence` stored per case
-2. **Universal model compatibility** — Strips `<think>`, `<reasoning>`, `<antThinking>` etc.
-3. **Type-casting validator** — Auto-converts `"8.5"` → `8.5` before rejecting
-4. **Floor scores** — Valid workflow attempts always get minimum credit
-5. **Deterministic case selection** — `hash(episode_id) % len(cases)` for reproducibility
-6. **Compatibility matrix separation** — Prevents context truncation for large observations
-7. **Patch-level version fuzzy** — `2.1.1` matches `2.1.0` by major.minor
-8. **Hallucination filter** — `_score_rank` filters step IDs not in `available_steps`
-9. **Persistent results** — `benchmark_store.py` writes to disk, survives restarts
-10. **Robust dependency fallback** — Works without `packaging` module via manual version parsing
 ---
-## ☑️ Compliance Checklist
-### Phase 1: Automated Validation (Pass/Fail)
-- [x] HF Space deploys and responds to `GET /`
-- [x] `openenv.yaml` present with all 9 task IDs
-- [x] `POST /reset` returns `episode_id` + `observation` for all 9 tasks
-- [x] `POST /step` returns `reward` (float, 0.0–1.0) + `done` (bool) + `observation`
-- [x] `GET /state` returns episode state
-- [x] All endpoints return HTTP 200 (never 500)
-- [x] `Dockerfile` at project root, builds cleanly
-- [x] `inference.py` at project root, runs under 20 min
-- [x] `openenv validate` passes
-### Phase 2: Agentic Evaluation (Scored)
-- [x] Observations include `task_type`, `task_subtype`, `task_description`, `available_actions`
-- [x] Partial credit graders (F1, NDCG, weighted sub-scores) — not binary
-- [x] Score variance across 9 tasks (varied difficulty = varied scores)
-- [x] `score_details` in step response for grading transparency
-- [x] `safe_score()` clamps all rewards to [0.0, 1.0]
-### Phase 3: Human Review
-- [x] 3 real-world domains (security, dependency, clinical)
-- [x] Multi-turn iterative workflows (identify → fix → revise)
-- [x] Rich validation hints for agent learning
-- [x] Debug panel with benchmark runner UI
-- [x] GRPO-compatible reward shaping

 app_port: 7860
 ---
+# 🌀 EntropyEnv — Multi-Agent Dev Tools Environment
 > A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**.
 > Built for the **Scaler × Meta × PyTorch × Hugging Face OpenEnv Hackathon 2026**.
+[![OpenEnv Compatible](https://img.shields.io/badge/OpenEnv-v1-blue)](https://huggingface.co/docs/openenv)
+[![Tasks](https://img.shields.io/badge/Tasks-9-green)](https://huggingface.co/spaces/immortalindeed/EntropyEnv)
+[![Domains](https://img.shields.io/badge/Domains-3-purple)]()
+[![Cases](https://img.shields.io/badge/Ground--Truth%20Cases-39-orange)]()
 ---
 ## 💡 Why This Environment?
+Most RL benchmarks test agents on **static, single-turn tasks** — classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**:
 - A security reviewer doesn't just find a bug — they **identify → propose a fix → revise after feedback**
 - A DevOps engineer doesn't just flag outdated packages — they **resolve version conflicts across an entire dependency graph**
 - A clinical coordinator doesn't just spot missing steps — they **prioritize by urgency and plan a dependency-safe recovery**
+**No existing RL environment tests agents on this full identify → act → revise cycle.** EntropyEnv fills that gap with 9 tasks across 3 real-world domains, progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
 ---
 ## 🎯 What Is This?
+![EntropyEnv Gradio UI](docs/screenshot.png)
+EntropyEnv is a **training gym for AI agents** — not the agent itself.
+Think of it like a driving test course: we build the course, and different AI "drivers" take the test.
+An AI agent connects via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** based on how good the answer is.
 ```
                     POST /reset
+AI Agent  ────────────────────────►  EntropyEnv
                                      │
+                                     ├── Picks a task case from the dataset
                                      ├── Returns: observation (the problem)
           ◄────────────────────────  │
                                      │
 ### 🔒 Domain 1: MCP Security Auditing
+Agents identify vulnerabilities in code snippets, propose secure fixes, and iteratively revise based on adversarial reviewer feedback.
+| Task | Difficulty | What the Agent Does |
+|------|-----------|---------------------|
+| `sec_easy` | 🟢 Easy | Classify a single vulnerability (type, CVSS, severity) |
+| `sec_medium` | 🟡 Medium | Identify → propose a code fix |
+| `sec_hard` | 🔴 Hard | Identify → fix → revise with adversarial reviewer feedback |
+**Coverage:** SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE
 ### 📦 Domain 2: PyTorch Migration Time-Machine
+Agents detect deprecated APIs, resolve version conflicts using compatibility matrices, and fix `torch.compile` graph-break patterns in dependency order.
+| Task | Difficulty | What the Agent Does |
+|------|-----------|---------------------|
+| `dep_easy` | 🟢 Easy | Flag outdated packages and deprecated API usage |
+| `dep_medium` | 🟡 Medium | Resolve version conflicts across package constraints |
+| `dep_hard` | 🔴 Hard | Fix torch.compile graph-breaks in correct dependency order |
+**Coverage:** Variable, cuda(), DataParallel, ONNX export, torch.compile, vmap, torch.export
 ### 🏥 Domain 3: Clinical Workflow Chaos Simulator
+Agents detect missing steps in hospital workflows, rank them by clinical priority, and plan dependency-ordered recovery sequences.
+| Task | Difficulty | What the Agent Does |
+|------|-----------|---------------------|
+| `cli_easy` | 🟢 Easy | Detect missing workflow steps and assess risk |
+| `cli_medium` | 🟡 Medium | Detect gaps → rank by clinical priority |
+| `cli_hard` | 🔴 Hard | Detect → rank → plan dependency-safe recovery |
+**Coverage:** Surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code
 ---
+## ⚡ Key Features
+| Feature | Description |
+|---------|-------------|
+| 🎯 **Partial-Credit Scoring** | F1, NDCG, weighted multi-component grading — not binary pass/fail |
+| 🔄 **Multi-Turn Episodes** | Agents iterate through identify → act → revise workflows |
+| 🛡️ **3-Stage Validation** | Schema → Domain → Consistency checks with helpful error hints |
+| 📊 **Score Breakdown** | Per-component feedback in every step so agents learn *what* to improve |
+| 🏎️ **Mastery Detection** | High-performing agents finish early — efficiency is rewarded |
+| 🌐 **Universal LLM Support** | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) |
+| 🐳 **Docker-Ready** | One-command deploy to Hugging Face Spaces |
+| 📈 **GRPO-Compatible** | Smooth reward gradients designed for policy optimization training |
 ---
+## 📡 API Reference
+| Method | Path | Description |
+|--------|------|-------------|
+| `GET /` | Health check | Returns status and available tasks |
+| `POST /reset` | Start episode | `{"task_id": "sec_easy"}` → `{episode_id, observation}` |
+| `POST /step` | Submit action | `{episode_id, action_type, ...}` → `{reward, done, observation}` |
+| `GET /state` | Query state | `?episode_id=xxx` → current episode info |
+| `GET /debug` | Debug panel | Interactive HTML benchmark runner |
+| `GET /web` | Gradio UI | Full task browser with run history |
+### Quick Example
+```python
+import requests
+# 1. Start an episode
+resp = requests.post("http://localhost:7860/reset", json={"task_id": "sec_easy"})
+data = resp.json()
+episode_id = data["episode_id"]
+observation = data["observation"]
+print(observation["task_description"])
+# → "Identify the SQL injection vulnerability in this code snippet."
+# 2. Send an action
+action = {
+    "episode_id": episode_id,
+    "action_type": "identify_vulnerability",
+    "vuln_type": "sql_injection",
+    "cvss_score": 9.1,
+    "severity": "critical",
+    "affected_line": 3
 }
+result = requests.post("http://localhost:7860/step", json=action).json()
+print(f"Reward: {result['reward']}, Done: {result['done']}")
+# → Reward: 0.85, Done: true
 ```
 ---
+## 🚀 Getting Started
+### Run Locally
+```bash
+# Install dependencies
+pip install fastapi uvicorn openai requests packaging gradio python-dotenv
+# Start the environment
+uvicorn server.app:app --host 0.0.0.0 --port 7860
+```
+### Run with Docker
+```bash
+docker build -t entropyenv .
+docker run -p 7860:7860 entropyenv
+```
+### Run the Baseline Agent
 ```bash
 export API_BASE_URL="https://router.huggingface.co/v1"
 export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
 export HF_TOKEN="your_token_here"
 export ENV_URL="http://localhost:7860"
+python inference.py
 ```
 ### Deploy to Hugging Face Spaces
 ```bash
 huggingface-cli login
+openenv push --repo-id <username>/EntropyEnv
 ```
 ---
+## 🏛️ Project Structure
 ```
+entropyenv/
+├── inference.py                # Baseline agent with smart prompt engineering
+├── openenv.yaml                # OpenEnv manifest (9 tasks)
+├── pyproject.toml              # Package configuration
+├── Dockerfile                  # Multi-stage Docker build
+├── server/
+│   ├── app.py                  # FastAPI server with rate limiting & session management
+│   ├── router.py               # Task dispatcher with mastery detection
+│   ├── session.py              # Episode state management
+│   ├── web_ui.py               # Gradio UI with performance dashboard
+│   ├── demo_agent.py           # Rule-based demo agent
+│   ├── benchmark_store.py      # Persistent results storage
+│   ├── debug_panel.html        # Interactive debug interface
+│   ├── validation/
+│   │   └── validator.py        # 3-stage validation with type-casting
+│   ├── graders/
+│   │   ├── base_grader.py      # Universal reward pipeline
+│   │   ├── security_grader.py  # Security domain grader
+│   │   ├── dependency_grader.py # Dependency domain grader
+│   │   └── clinical_grader.py  # Clinical domain grader
+│   └── datasets/
+│       ├── security_cases.py   # 13 ground-truth security cases
+│       ├── dependency_cases.py # 13 ground-truth dependency cases
+│       └── clinical_cases.py   # 13 ground-truth clinical cases
+└── results/
+    └── run_history.json        # Benchmark history (auto-created)
 ```
+---
+## 📈 Baseline Performance
+Tested across multiple model families to ensure universal compatibility:
+| Model | Family | Average Score |
+|-------|--------|---------------|
+| Llama 3.3 70B | Meta | **0.87** |
+| Qwen3-32B | Alibaba | **0.89** |
+| DeepSeek V3.2 | DeepSeek | **0.86** |
+The environment provides smooth reward gradients suitable for GRPO-based training of models as small as 8B parameters.
+---
+## 📝 Inference Log Format
+The baseline `inference.py` emits structured logs matching the OpenEnv spec:
+```
+[START] task=sec_easy env=multi-agent-dev-tools-env model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
+[STEP] step=2 action=propose_fix reward=0.92 done=true error=null
+[END] success=true steps=2 score=0.89 rewards=0.85,0.92
+```
 ---
+## 🤝 Built With
+- **[FastAPI](https://fastapi.tiangolo.com/)** — High-performance async API framework
+- **[Gradio](https://gradio.app/)** — Interactive web UI for testing and visualization
+- **[PyTorch](https://pytorch.org/)** — Domain expertise for migration tasks
+- **[OpenEnv](https://huggingface.co/docs/openenv)** — Standardized RL environment specification
 ---
+<p align="center">
+  <b>Built with ❤️ for the Scaler × Meta × PyTorch × Hugging Face OpenEnv Hackathon 2026</b>
+</p>