EntropyEnv / README.md
immortalindeed's picture
chore: Apply Bug #2 and Bug #3 strict min/max bound clamping to prevent out of range scores and fix windows encoding
ee547a6
---
title: EntropyEnv
emoji: πŸŒ€
colorFrom: purple
colorTo: blue
sdk: docker
app_port: 7860
---
# πŸŒ€ EntropyEnv β€” Multi-Agent Dev Tools Environment.
> A multi-domain RL environment for training and evaluating AI agents on **real-world developer and clinical tasks**.
> Built for the **Scaler Γ— Meta Γ— PyTorch Γ— Hugging Face OpenEnv Hackathon 2026**.
[![OpenEnv Compatible](https://img.shields.io/badge/OpenEnv-v1-blue)](https://huggingface.co/docs/openenv)
[![Tasks](https://img.shields.io/badge/Tasks-9-green)](https://huggingface.co/spaces/immortalindeed/EntropyEnv)
[![Domains](https://img.shields.io/badge/Domains-3-purple)]()
[![Cases](https://img.shields.io/badge/Ground--Truth%20Cases-39-orange)]()
---
## πŸ’‘ Why This Environment?
Most RL benchmarks test agents on **static, single-turn tasks** β€” classify this image, answer this question. But real developer workflows are **multi-turn, iterative, and require revision**:
- A security reviewer doesn't just find a bug β€” they **identify β†’ propose a fix β†’ revise after feedback**
- A DevOps engineer doesn't just flag outdated packages β€” they **resolve version conflicts across an entire dependency graph**
- A clinical coordinator doesn't just spot missing steps β€” they **prioritize by urgency and plan a dependency-safe recovery**
**No existing RL environment tests agents on this full identify β†’ act β†’ revise cycle.** EntropyEnv fills that gap with 9 tasks across 3 real-world domains, progressive difficulty, rich partial-credit scoring, and iterative multi-turn episodes.
---
## 🎯 What Is This?
EntropyEnv is a **training gym for AI agents** β€” not the agent itself.
Think of it like a driving test course: we build the course, and different AI "drivers" take the test.
An AI agent connects via API, receives a **task** (e.g., "find the vulnerability in this code"), sends back an **action** (its answer), and gets a **reward score** based on how good the answer is.
```
POST /reset
AI Agent ────────────────────────► EntropyEnv
β”‚
β”œβ”€β”€ Picks a task case from the dataset
β”œβ”€β”€ Returns: observation (the problem)
◄──────────────────────── β”‚
β”‚
POST /step β”‚
────────────────────────► β”‚
β”œβ”€β”€ Validates the action (3 stages)
β”œβ”€β”€ Grades it (domain-specific grader)
◄──────────────────────── β”œβ”€β”€ Returns: reward + done + next observation
β”‚
(repeat until done) β”‚
```
---
## πŸ—οΈ Three Domains, Nine Tasks
### πŸ”’ Domain 1: MCP Security Auditing
Agents identify vulnerabilities in code snippets, propose secure fixes, and iteratively revise based on adversarial reviewer feedback.
| Task | Difficulty | What the Agent Does |
|------|-----------|---------------------|
| `sec_easy` | 🟒 Easy | Classify a single vulnerability (type, CVSS, severity) |
| `sec_medium` | 🟑 Medium | Identify β†’ propose a code fix |
| `sec_hard` | πŸ”΄ Hard | Identify β†’ fix β†’ revise with adversarial reviewer feedback |
**Coverage:** SQL injection, XSS, IDOR, hardcoded secrets, missing auth, JWT misuse, path traversal, SSRF, XXE
### πŸ“¦ Domain 2: PyTorch Migration Time-Machine
Agents detect deprecated APIs, resolve version conflicts using compatibility matrices, and fix `torch.compile` graph-break patterns in dependency order.
| Task | Difficulty | What the Agent Does |
|------|-----------|---------------------|
| `dep_easy` | 🟒 Easy | Flag outdated packages and deprecated API usage |
| `dep_medium` | 🟑 Medium | Resolve version conflicts across package constraints |
| `dep_hard` | πŸ”΄ Hard | Fix torch.compile graph-breaks in correct dependency order |
**Coverage:** Variable, cuda(), DataParallel, ONNX export, torch.compile, vmap, torch.export
### πŸ₯ Domain 3: Clinical Workflow Chaos Simulator
Agents detect missing steps in hospital workflows, rank them by clinical priority, and plan dependency-ordered recovery sequences.
| Task | Difficulty | What the Agent Does |
|------|-----------|---------------------|
| `cli_easy` | 🟒 Easy | Detect missing workflow steps and assess risk |
| `cli_medium` | 🟑 Medium | Detect gaps β†’ rank by clinical priority |
| `cli_hard` | πŸ”΄ Hard | Detect β†’ rank β†’ plan dependency-safe recovery |
**Coverage:** Surgery prep, ER triage, chemotherapy, cardiac emergency, blood transfusion, organ transplant, stroke code
---
## ⚑ Key Features
| Feature | Description |
|---------|-------------|
| 🎯 **Partial-Credit Scoring** | F1, NDCG, weighted multi-component grading β€” not binary pass/fail |
| πŸ”„ **Multi-Turn Episodes** | Agents iterate through identify β†’ act β†’ revise workflows |
| πŸ›‘οΈ **3-Stage Validation** | Schema β†’ Domain β†’ Consistency checks with helpful error hints |
| πŸ“Š **Score Breakdown** | Per-component feedback in every step so agents learn *what* to improve |
| 🏎️ **Fatal Error Handling** | Automatic 402/401/403 detection stops wasted API calls immediately |
| 🌐 **Universal LLM Support** | Works with any OpenAI-compatible model (Qwen, Llama, DeepSeek, Gemini, etc.) |
| 🐳 **Docker-Ready** | One-command deploy to Hugging Face Spaces |
| πŸ“ˆ **GRPO-Compatible** | Smooth reward gradients designed for policy optimization training |
---
## πŸ“‘ API Reference
| Method | Path | Description |
|--------|------|-------------|
| `GET /` | Health check | Returns status and available tasks |
| `POST /reset` | Start episode | `{"task_id": "sec_easy"}` β†’ `{episode_id, observation}` |
| `POST /step` | Submit action | `{episode_id, action_type, ...}` β†’ `{reward, done, observation}` |
| `GET /state` | Query state | `?episode_id=xxx` β†’ current episode info |
| `GET /debug` | Debug panel | Interactive HTML benchmark runner |
| `GET /web` | Gradio UI | Full task browser with run history |
### Quick Example
```python
import requests
# 1. Start an episode
resp = requests.post("http://localhost:7860/reset", json={"task_id": "sec_easy"})
data = resp.json()
episode_id = data["episode_id"]
observation = data["observation"]
print(observation["task_description"])
# β†’ "Identify the SQL injection vulnerability in this code snippet."
# 2. Send an action
action = {
"episode_id": episode_id,
"action_type": "identify_vulnerability",
"vuln_type": "sql_injection",
"cvss_score": 9.1,
"severity": "critical",
"affected_line": 3
}
result = requests.post("http://localhost:7860/step", json=action).json()
print(f"Reward: {result['reward']}, Done: {result['done']}")
# β†’ Reward: 0.85, Done: true
```
---
## πŸš€ Getting Started
### Run Locally
```bash
# Install dependencies
pip install fastapi uvicorn openai requests packaging gradio python-dotenv
# Start the environment
uvicorn server.app:app --host 0.0.0.0 --port 7860
```
### Run with Docker
```bash
docker build -t entropyenv .
docker run -p 7860:7860 entropyenv
```
### Run the Baseline Agent
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
export HF_TOKEN="your_token_here"
export ENV_URL="http://localhost:7860"
python inference.py
```
### Deploy to Hugging Face Spaces
```bash
huggingface-cli login
openenv push --repo-id <username>/EntropyEnv
```
---
## πŸ›οΈ Project Structure
```
entropyenv/
β”œβ”€β”€ inference.py # Baseline agent with smart prompt engineering
β”œβ”€β”€ openenv.yaml # OpenEnv manifest (9 tasks)
β”œβ”€β”€ pyproject.toml # Package configuration
β”œβ”€β”€ Dockerfile # Multi-stage Docker build
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ app.py # FastAPI server with session management
β”‚ β”œβ”€β”€ router.py # Task dispatcher with Counter-based sequence checking
β”‚ β”œβ”€β”€ session.py # Episode state management
β”‚ β”œβ”€β”€ web_ui.py # Gradio UI with performance dashboard
β”‚ β”œβ”€β”€ demo_agent.py # Rule-based demo agent
β”‚ β”œβ”€β”€ benchmark_store.py # Persistent results storage
β”‚ β”œβ”€β”€ debug_panel.html # Interactive debug interface
β”‚ β”œβ”€β”€ validation/
β”‚ β”‚ └── validator.py # 3-stage validation with type-casting
β”‚ β”œβ”€β”€ graders/
β”‚ β”‚ β”œβ”€β”€ base_grader.py # Universal reward pipeline
β”‚ β”‚ β”œβ”€β”€ security_grader.py # Security domain grader
β”‚ β”‚ β”œβ”€β”€ dependency_grader.py # Dependency domain grader
β”‚ β”‚ └── clinical_grader.py # Clinical domain grader
β”‚ └── datasets/
β”‚ β”œβ”€β”€ security_cases.py # 13 ground-truth security cases
β”‚ β”œβ”€β”€ dependency_cases.py # 13 ground-truth dependency cases
β”‚ └── clinical_cases.py # 13 ground-truth clinical cases
└── results/
└── run_history.json # Benchmark history (auto-created)
```
---
## πŸ“ˆ Baseline Performance
> **Note:** Scores below are from the latest grading revision (v3: weighted 0.60Γ—max + 0.40Γ—mean scoring, difficulty_multiplier removed, dep_hard done-condition fixed). Re-benchmarking across 14+ models in progress.
| Model | Provider | sec_easy | sec_med | sec_hard | dep_easy | dep_med | dep_hard | cli_easy | cli_med | cli_hard | **Avg** |
|-------|----------|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:--------:|:-------:|:--------:|:-------:|
| *(Run `python unnecessary/run_14_models.py` to auto-populate this table)* | | | | | | | | | | | |
**Scoring formula:** `score = 0.60 Γ— max(step_rewards) + 0.40 Γ— mean(step_rewards)`, clamped to `[0.01, 0.99]`
**Design principles:**
- 🎯 **No artificial difficulty caps** β€” scores reflect actual grader correctness
- πŸ“Š **Weighted blend** β€” rewards consistently good episodes over single-lucky-step flukes
- πŸ”¬ **Spec-compliant** β€” `[END]` lines perfectly match the 3-line format mandatory rules
- 🧠 **14+ model families tested** for universal compatibility
---
## πŸ“ Inference Log Format
The baseline `inference.py` emits structured logs matching the OpenEnv spec:
```
[START] task=sec_easy env=EntropyEnv model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=identify_vulnerability reward=0.85 done=false error=null
[STEP] step=2 action=propose_fix reward=0.92 done=true error=null
[END] success=true steps=2 score=0.89 rewards=0.85,0.92
```
---
## 🀝 Built With
- **[FastAPI](https://fastapi.tiangolo.com/)** β€” High-performance async API framework
- **[Gradio](https://gradio.app/)** β€” Interactive web UI for testing and visualization
- **[PyTorch](https://pytorch.org/)** β€” Domain expertise for migration tasks
- **[OpenEnv](https://huggingface.co/docs/openenv)** β€” Standardized RL environment specification
---
<p align="center">
<b>Built with ❀️ for the Scaler Γ— Meta Γ— PyTorch Γ— Hugging Face OpenEnv Hackathon 2026</b>
</p>