Spaces:
Running
Running
chore: Apply Bug #2 and Bug #3 strict min/max bound clamping to prevent out of range scores and fix windows encoding
ee547a6 | title: EntropyEnv | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| # π EntropyEnv β Multi-Agent Dev Tools Environment. | |
| > A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**. | |
| > Built for the **Scaler Γ Meta Γ PyTorch Γ Hugging Face OpenEnv Hackathon 2026**. | |
| [](https://huggingface.co/docs/openenv) | |
| [](https://huggingface.co/spaces/immortalindeed/EntropyEnv) | |
| []() | |
| []() | |
| --- | |
| ## π‘ Why This Environment? | |
| Most RL benchmarks test agents on **static, single-turn tasks** β classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**: | |
| - A security reviewer doesn't just find a bug β they **identify β propose a fix β revise after feedback** | |
| - A DevOps engineer doesn't just flag outdated packages β they **resolve version conflicts across an entire dependency graph** | |
| - A clinical coordinator doesn't just spot missing steps β they **prioritize by urgency and plan a dependency-safe recovery** | |
| **No existing RL environment tests agents on this full identify β act β revise cycle.** EntropyEnv fills that gap with 9 tasks across 3 real-world domains, progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes. | |
| --- | |
| ## π― What Is This? | |
| EntropyEnv is a **training gym for AI agents** β not the agent itself. | |
| Think of it like a driving test course: we build the course, and different AI "drivers" take the test. | |
| An AI agent connects via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** based on how good the answer is. | |
| ``` | |
| POST /reset | |
| AI Agent βββββββββββββββββββββββββΊ EntropyEnv | |
| β | |
| βββ Picks a task case from the dataset | |
| βββ Returns: observation (the problem) | |
| βββββββββββββββββββββββββ β | |
| β | |
| POST /step β | |
| βββββββββββββββββββββββββΊ β | |
| βββ Validates the action (3 stages) | |
| βββ Grades it (domain-specific grader) | |
| βββββββββββββββββββββββββ βββ Returns: reward + done + next observation | |
| β | |
| (repeat until done) β | |
| ``` | |
| --- | |
| ## ποΈ Three Domains, Nine Tasks | |
| ### π Domain 1: MCP Security Auditing | |
| Agents identify vulnerabilities in code snippets, propose secure fixes, and iteratively revise based on adversarial reviewer feedback. | |
| | Task | Difficulty | What the Agent Does | | |
| |------|-----------|---------------------| | |
| | `sec_easy` | π’ Easy | Classify a single vulnerability (type, CVSS, severity) | | |
| | `sec_medium` | π‘ Medium | Identify β propose a code fix | | |
| | `sec_hard` | π΄ Hard | Identify β fix β revise with adversarial reviewer feedback | | |
| **Coverage:** SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE | |
| ### π¦ Domain 2: PyTorch Migration Time-Machine | |
| Agents detect deprecated APIs, resolve version conflicts using compatibility matrices, and fix `torch.compile` graph-break patterns in dependency order. | |
| | Task | Difficulty | What the Agent Does | | |
| |------|-----------|---------------------| | |
| | `dep_easy` | π’ Easy | Flag outdated packages and deprecated API usage | | |
| | `dep_medium` | π‘ Medium | Resolve version conflicts across package constraints | | |
| | `dep_hard` | π΄ Hard | Fix torch.compile graph-breaks in correct dependency order | | |
| **Coverage:** Variable, cuda(), DataParallel, ONNX export, torch.compile, vmap, torch.export | |
| ### π₯ Domain 3: Clinical Workflow Chaos Simulator | |
| Agents detect missing steps in hospital workflows, rank them by clinical priority, and plan dependency-ordered recovery sequences. | |
| | Task | Difficulty | What the Agent Does | | |
| |------|-----------|---------------------| | |
| | `cli_easy` | π’ Easy | Detect missing workflow steps and assess risk | | |
| | `cli_medium` | π‘ Medium | Detect gaps β rank by clinical priority | | |
| | `cli_hard` | π΄ Hard | Detect β rank β plan dependency-safe recovery | | |
| **Coverage:** Surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code | |
| --- | |
| ## β‘ Key Features | |
| | Feature | Description | | |
| |---------|-------------| | |
| | π― **Partial-Credit Scoring** | F1, NDCG, weighted multi-component grading β not binary pass/fail | | |
| | π **Multi-Turn Episodes** | Agents iterate through identify β act β revise workflows | | |
| | π‘οΈ **3-Stage Validation** | Schema β Domain β Consistency checks with helpful error hints | | |
| | π **Score Breakdown** | Per-component feedback in every step so agents learn *what* to improve | | |
| | ποΈ **Fatal Error Handling** | Automatic 402/401/403 detection stops wasted API calls immediately | | |
| | π **Universal LLM Support** | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) | | |
| | π³ **Docker-Ready** | One-command deploy to Hugging Face Spaces | | |
| | π **GRPO-Compatible** | Smooth reward gradients designed for policy optimization training | | |
| --- | |
| ## π‘ API Reference | |
| | Method | Path | Description | | |
| |--------|------|-------------| | |
| | `GET /` | Health check | Returns status and available tasks | | |
| | `POST /reset` | Start episode | `{"task_id": "sec_easy"}` β `{episode_id, observation}` | | |
| | `POST /step` | Submit action | `{episode_id, action_type, ...}` β `{reward, done, observation}` | | |
| | `GET /state` | Query state | `?episode_id=xxx` β current episode info | | |
| | `GET /debug` | Debug panel | Interactive HTML benchmark runner | | |
| | `GET /web` | Gradio UI | Full task browser with run history | | |
| ### Quick Example | |
| ```python | |
| import requests | |
| # 1. Start an episode | |
| resp = requests.post("http://localhost:7860/reset", json={"task_id": "sec_easy"}) | |
| data = resp.json() | |
| episode_id = data["episode_id"] | |
| observation = data["observation"] | |
| print(observation["task_description"]) | |
| # β "Identify the SQL injection vulnerability in this code snippet." | |
| # 2. Send an action | |
| action = { | |
| "episode_id": episode_id, | |
| "action_type": "identify_vulnerability", | |
| "vuln_type": "sql_injection", | |
| "cvss_score": 9.1, | |
| "severity": "critical", | |
| "affected_line": 3 | |
| } | |
| result = requests.post("http://localhost:7860/step", json=action).json() | |
| print(f"Reward: {result['reward']}, Done: {result['done']}") | |
| # β Reward: 0.85, Done: true | |
| ``` | |
| --- | |
| ## π Getting Started | |
| ### Run Locally | |
| ```bash | |
| # Install dependencies | |
| pip install fastapi uvicorn openai requests packaging gradio python-dotenv | |
| # Start the environment | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Run with Docker | |
| ```bash | |
| docker build -t entropyenv . | |
| docker run -p 7860:7860 entropyenv | |
| ``` | |
| ### Run the Baseline Agent | |
| ```bash | |
| export API_BASE_URL="https://router.huggingface.co/v1" | |
| export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| export HF_TOKEN="your_token_here" | |
| export ENV_URL="http://localhost:7860" | |
| python inference.py | |
| ``` | |
| ### Deploy to Hugging Face Spaces | |
| ```bash | |
| huggingface-cli login | |
| openenv push --repo-id <username>/EntropyEnv | |
| ``` | |
| --- | |
| ## ποΈ Project Structure | |
| ``` | |
| entropyenv/ | |
| βββ inference.py # Baseline agent with smart prompt engineering | |
| βββ openenv.yaml # OpenEnv manifest (9 tasks) | |
| βββ pyproject.toml # Package configuration | |
| βββ Dockerfile # Multi-stage Docker build | |
| βββ server/ | |
| β βββ app.py # FastAPI server with session management | |
| β βββ router.py # Task dispatcher with Counter-based sequence checking | |
| β βββ session.py # Episode state management | |
| β βββ web_ui.py # Gradio UI with performance dashboard | |
| β βββ demo_agent.py # Rule-based demo agent | |
| β βββ benchmark_store.py # Persistent results storage | |
| β βββ debug_panel.html # Interactive debug interface | |
| β βββ validation/ | |
| β β βββ validator.py # 3-stage validation with type-casting | |
| β βββ graders/ | |
| β β βββ base_grader.py # Universal reward pipeline | |
| β β βββ security_grader.py # Security domain grader | |
| β β βββ dependency_grader.py # Dependency domain grader | |
| β β βββ clinical_grader.py # Clinical domain grader | |
| β βββ datasets/ | |
| β βββ security_cases.py # 13 ground-truth security cases | |
| β βββ dependency_cases.py # 13 ground-truth dependency cases | |
| β βββ clinical_cases.py # 13 ground-truth clinical cases | |
| βββ results/ | |
| βββ run_history.json # Benchmark history (auto-created) | |
| ``` | |
| --- | |
| ## π Baseline Performance | |
| > **Note:** Scores below are from the latest grading revision (v3: weighted 0.60Γmax + 0.40Γmean scoring, difficulty_multiplier removed, dep_hard done-condition fixed). Re-benchmarking across 14+ models in progress. | |
| | Model | Provider | sec_easy | sec_med | sec_hard | dep_easy | dep_med | dep_hard | cli_easy | cli_med | cli_hard | **Avg** | | |
| |-------|----------|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:-------:| | |
| | *(Run `python unnecessary/run_14_models.py` to auto-populate this table)* | | | | | | | | | | | | | |
| **Scoring formula:** `score = 0.60 Γ max(step_rewards) + 0.40 Γ mean(step_rewards)`, clamped to `[0.01, 0.99]` | |
| **Design principles:** | |
| - π― **No artificial difficulty caps** β scores reflect actual grader correctness | |
| - π **Weighted blend** β rewards consistently good episodes over single-lucky-step flukes | |
| - π¬ **Spec-compliant** β `[END]` lines perfectly match the 3-line format mandatory rules | |
| - π§ **14+ model families tested** for universal compatibility | |
| --- | |
| ## π Inference Log Format | |
| The baseline `inference.py` emits structured logs matching the OpenEnv spec: | |
| ``` | |
| [START] task=sec_easy env=EntropyEnv model=Qwen/Qwen2.5-72B-Instruct | |
| [STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null | |
| [STEP] step=2 action=propose_fix reward=0.92 done=true error=null | |
| [END] success=true steps=2 score=0.89 rewards=0.85,0.92 | |
| ``` | |
| --- | |
| ## π€ Built With | |
| - **[FastAPI](https://fastapi.tiangolo.com/)** β High-performance async API framework | |
| - **[Gradio](https://gradio.app/)** β Interactive web UI for testing and visualization | |
| - **[PyTorch](https://pytorch.org/)** β Domain expertise for migration tasks | |
| - **[OpenEnv](https://huggingface.co/docs/openenv)** β Standardized RL environment specification | |
| --- | |
| <p align="center"> | |
| <b>Built with β€οΈ for the Scaler Γ Meta Γ PyTorch Γ Hugging Face OpenEnv Hackathon 2026</b> | |
| </p> | |