--- title: SecureReview emoji: ๐Ÿ›ก colorFrom: gray colorTo: indigo sdk: docker app_port: 7860 pinned: true license: mit tags: - openenv - security - code-review - agent - evaluation - rl short_description: The agent review benchmark for the age of AI. ---

# SecureReview ### *Security review, for the age of AI.* **The first evaluation harness that holds AI agents to the bar of a senior engineer at code review.** *Three domains. **76** hand-crafted scenarios. **430** production-grade vulnerabilities.* *Built for the **Meta ร— Hugging Face OpenEnv Hackathon** ยท India 2026 โ€” by **~The Cook House**.*
[![OpenEnv](https://img.shields.io/badge/OpenEnv-v1.0-0a0a0a?style=for-the-badge&labelColor=0a0a0a)](https://github.com/meta-pytorch/OpenEnv) [![Hugging Face](https://img.shields.io/badge/๐Ÿค—%20Hugging%20Face-Live-0a0a0a?style=for-the-badge&labelColor=0a0a0a)](https://huggingface.co/spaces/sam25kat/securereview) [![Python](https://img.shields.io/badge/Python-3.10+-0a0a0a?style=for-the-badge&logo=python&logoColor=white&labelColor=0a0a0a)](https://python.org) [![License](https://img.shields.io/badge/License-MIT-0a0a0a?style=for-the-badge&labelColor=0a0a0a)](LICENSE)
[**Live Environment**](https://sam25kat-securereview.hf.space) ยท [**API Docs**](https://sam25kat-securereview.hf.space/docs) ยท [**Hugging Face Space**](https://huggingface.co/spaces/sam25kat/securereview)
--- ## Thesis > **AI now authors a generation of production code. Review is the bottleneck โ€” not authorship.** > > An agent that cannot review code at the level of a senior engineer cannot be trusted to write it. SecureReview is the benchmark that holds agents to that bar. Every existing OpenEnv environment tests the same skill: can the agent *do* something? Play a game, navigate a grid, call a tool, write an answer. None of them test the skill that matters most in a world of AI-generated code: **can the agent read what's already there, and spot what will break production?** This is the category SecureReview opens.
## The three domains SecureReview is grounded in three categories of real-world incidents that have cost companies billions. Each maps cleanly to a concrete failure mode that human reviewers catch โ€” and that AI-generated code regularly ships anyway. | | Domain | Real-world precedent | |---|--------|---------------------| | **I** | Supply chain compromise | `SolarWinds` ยท `event-stream` ยท `ua-parser-js` | | **II** | Cloud misconfiguration | `Capital One` ยท every public S3 bucket post-mortem | | **III** | Unsafe database migrations | `GitHub outages` ยท `Slack incidents` ยท every AWS RCA | An agent that scores well on SecureReview is an agent you could actually let touch production code.
## The benchmark
### I. Dependency & Supply Chain Security Identify typosquatted packages, hallucinated imports that do not exist on PyPI, and pinned versions with active CVEs. Tests the baseline of supply-chain literacy every reviewer should have. `requirements.txt` ยท `package.json` **24 scenarios ยท 120 findings ยท 15 steps** ##### Easy ### II. Infrastructure-as-Code Misconfiguration Detection Catch CIS-benchmark violations in Terraform and Kubernetes โ€” public buckets, wildcard IAM, missing encryption, privileged containers, cross-account trust. Tests multi-file cloud security reasoning. Terraform `.tf` ยท Kubernetes YAML **24 scenarios ยท 155 findings ยท 25 steps** ##### Medium ### III. Database Migration Safety Analysis Reason about SQL migrations against live production context โ€” table sizes, write throughput, deployment strategy, downstream services. Tests the hardest form of review: **judgment**. Schema ยท migrations ยท app code **28 scenarios ยท 155 findings ยท 35 steps** ##### Hard

## Why it is different | | Typical OpenEnv environment | SecureReview | |---|---|---| | **Task** | Game, toy, synthetic | Real production artifact | | **Skill tested** | Acting in the world | Reading the world | | **Ground truth** | Game rules | Senior-engineer judgment | | **Reward** | Game score | Deterministic F1 over planted vulnerabilities | | **Transfer** | To more games | To shipping code in production |
## Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” HTTP โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ โ”‚ โ”‚ Your Agent โ”‚ reset / step โ”‚ FastAPI Server โ”‚ โ”‚ (OpenAI SDK) โ”‚ state โ”‚ (Docker ยท HF) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Task Registry โ”‚ โ”‚ Deterministic โ”‚ โ”‚ 76 scenarios โ”‚ โ”‚ F1 Grader โ”‚ โ”‚ 430 findings โ”‚ โ”‚ (task-specific) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Every scenario is a closed world. Every grader is deterministic. Every score is reproducible. No LLM-as-judge. No fuzzy matching that can be gamed.
## Action space Four primitives. Enough to support partial-information reasoning without drowning the agent in tool choice. ```python class Action: action_type: Literal[ "report_finding", # submit a security finding "request_context", # load another file into the review context "request_file_list", # discover available files "mark_complete", # end the episode and trigger grading ] finding: Optional[Finding] # required for report_finding filename: Optional[str] # required for request_context ``` Every `Finding` is a typed record: `file`, `line`, `rule_id`, `severity`, `description`. The agent reports as many as its step budget allows.
## Reward ``` score = F1(precision, recall) ร— 0.83 + severity_bonus (โ‰ค 0.10) + efficiency_bonus (โ‰ค 0.05) + participation_bonus (= 0.01) โˆ’ false_positive_penalty (โ‰ค 0.20) ``` Clamped strictly to the open interval `(0.01, 0.99)`. Deterministic and reproducible. #### Matching strategy | Task | Primary match | Fallback | |------|---------------|----------| | `dependency_review` | Package name in description | Line number | | `iac_review` | `(resource_id, rule_category)` | File + category | | `migration_review` | `(operation, target_object)` | Line + rule_id |
## Quick start #### Against the hosted environment ```python import requests ENV = "https://sam25kat-securereview.hf.space" # Start an episode r = requests.post(f"{ENV}/reset", json={"task_id": "dependency_review"}) observation = r.json()["observation"] # Report a finding action = { "action_type": "report_finding", "finding": { "file": "requirements.txt", "line": 2, "rule_id": "DEP-002", "severity": "critical", "description": "Typosquat: 'reqeusts' is a misspelling of 'requests'", }, } requests.post(f"{ENV}/step", json={"action": action}) # End the episode and receive the final score r = requests.post(f"{ENV}/step", json={"action": {"action_type": "mark_complete"}}) print(f"score = {r.json()['reward']}") ``` #### Run the baseline agent ```bash export API_BASE_URL="https://router.huggingface.co/v1" export MODEL_NAME="deepseek-ai/DeepSeek-V3-0324" export HF_TOKEN="hf_..." export ENV_URL="https://sam25kat-securereview.hf.space" python inference.py ``` #### Run locally with Docker ```bash docker build -t securereview . docker run -p 7860:7860 securereview ```
## Interface | Method | Endpoint | Description | |--------|----------|-------------| | `GET` | `/` | Landing page | | `GET` | `/health` | Health check | | `GET` | `/tasks` | List available tasks | | `GET` | `/metadata` | Environment metadata | | `GET` | `/schema` | Action / observation / state JSON schemas | | `GET` | `/state` | Current episode state | | `GET` | `/docs` | OpenAPI interactive docs | | `POST` | `/reset` | Start a new episode | | `POST` | `/step` | Execute an action | | `POST` | `/mcp` | JSON-RPC 2.0 MCP endpoint |
## Baseline Evaluated against the live Space with `deepseek-ai/DeepSeek-V3-0324` via the Hugging Face Inference Router. | Task | Difficulty | Score | |------|:----------:|:-----:| | `dependency_review` | Easy | `0.45` | | `iac_review` | Medium | `0.52` | | `migration_review` | Hard | `0.05` | | **Average** | | **`0.34`** | Oracle reference (agent submitting ground-truth findings): **`0.98`** โ€” validates grader correctness. The hard task is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves.
## Training results We trained models on the live environment using the **canonical industry-standard hybrid pipeline โ€” SFT warmup โ†’ GRPO refinement** โ€” the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack. Same env, same evaluation harness, end-to-end against the live grader. | Task | Method | Baseline | Trained | **Improvement** | Wins | |------|:-------|:--------:|:-------:|:---------------:|:----:| | `dependency_review` | SFTโ†’GRPO (Qwen 1.5B, 24 scenarios, 3 epochs) | `0.083` | `0.385` | **+0.302** โฌ†โฌ† | 20/24 | | `migration_review` | SFTโ†’GRPO (Qwen 7B, 12 scenarios, 3 epochs) | `0.170` | `0.465` | **+0.295** โฌ†โฌ† | 10/12 | | `iac_review` | SFTโ†’GRPO (Qwen 1.5B, 13 scenarios, 3 epochs) | `0.177` | `0.303` | **+0.126** โฌ†โฌ† | 6/13 | Average improvement across tasks: **~+0.24 mean reward**, with individual scenarios gaining as much as **+0.91**. Training took **under 30 seconds** per task on a single GPU (A10G / L40S / L4). ### Per-task before/after **Dependency review** โ€” `+0.302` mean lift across 24 scenarios: ![Dependency review โ€” before vs after SFT](training_results/plots/dep/before_after.png) **Migration review** โ€” `+0.295` mean lift across 12 scenarios: ![Migration review โ€” before vs after SFT](training_results/plots/migration/before_after.png) **IaC review** โ€” `+0.126` mean lift across 13 scenarios: ![IaC review โ€” before vs after SFT](training_results/plots/iac/before_after.png) The full story โ€” per-scenario breakdowns, training loss curves, hyperparameter sweeps, scenario-curriculum design, and engineering tradeoffs โ€” is in [training_results/RESULTS.md](training_results/RESULTS.md). Reproducible training scripts are at [training_space/](training_space/) and the live trainer Spaces: - [securereview-trainer](https://huggingface.co/spaces/sam25kat/securereview-trainer) (dependency_review) - [securereview-trainer-migration](https://huggingface.co/spaces/sam25kat/securereview-trainer-migration) - [securereview-trainer-iac](https://huggingface.co/spaces/sam25kat/securereview-trainer-iac)
## Blog & writeup - **Mini-blog**: [BLOG.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/BLOG.md) โ€” submission writeup with problem, env, training pipeline, and results. Lives as a separate MD file at the root of the HF Space, per hackathon submission guidance. - **Mirror discussion**: [HF community thread](https://huggingface.co/spaces/sam25kat/securereview/discussions/1#69edb47ef08cea42f0f4df3a) โ€” same content posted to the Space's Community tab for visibility. - **Full results**: [training_results/RESULTS.md](training_results/RESULTS.md) - **Complete scenario index** (all 76): [training_results/SCENARIOS.md](training_results/SCENARIOS.md) โ€” file inventory, severity distribution, categories, per-scenario before/after. - **Plots**: [training_results/plots/](training_results/plots/) โ€” committed PNGs for all three tasks (before/after + training loss). - **Per-task summaries**: [dep](training_results/dep_sft_summary.md) ยท [migration](training_results/migration_sft_summary.md) ยท [iac](training_results/iac_sft_summary.md)
## Project structure ``` securereview/ โ”œโ”€โ”€ app/ โ”‚ โ”œโ”€โ”€ main.py FastAPI endpoints โ”‚ โ”œโ”€โ”€ landing.py Premium HTML landing page โ”‚ โ”œโ”€โ”€ environment.py Episode state machine โ”‚ โ”œโ”€โ”€ models.py Pydantic types โ”‚ โ”œโ”€โ”€ graders/ โ”‚ โ”‚ โ”œโ”€โ”€ base.py F1 + severity + efficiency scoring โ”‚ โ”‚ โ”œโ”€โ”€ dependency_grader.py โ”‚ โ”‚ โ”œโ”€โ”€ iac_grader.py โ”‚ โ”‚ โ””โ”€โ”€ migration_grader.py โ”‚ โ””โ”€โ”€ tasks/ โ”‚ โ”œโ”€โ”€ task_registry.py Scenario discovery โ”‚ โ””โ”€โ”€ scenarios/ 76 hand-crafted scenarios โ”‚ โ”œโ”€โ”€ dependency/ 24 scenarios โ”‚ โ”œโ”€โ”€ iac/ 24 scenarios โ”‚ โ””โ”€โ”€ migration/ 28 scenarios โ”‚ โ”œโ”€โ”€ server/ โ”‚ โ””โ”€โ”€ app.py OpenEnv multi-mode entry point โ”œโ”€โ”€ inference.py Baseline agent (OpenAI client) โ”œโ”€โ”€ openenv.yaml Environment manifest โ”œโ”€โ”€ pyproject.toml Package definition โ”œโ”€โ”€ uv.lock Reproducible dependency lock โ””โ”€โ”€ Dockerfile ```
## OpenEnv compliance | Check | Status | |-------|:------:| | `openenv validate .` (local) | โœ“ | | `openenv validate --url` (runtime) | โœ“ | | Docker build | โœ“ | | Multi-mode deployment (`docker`, `uv_run`, `python_module`, `openenv_serve`) | โœ“ | | Hugging Face Space deploys | โœ“ | | `/health`, `/metadata`, `/schema`, `/mcp`, `/reset`, `/step`, `/state` | โœ“ | | Typed Pydantic action / observation / state | โœ“ | | Deterministic grader, strictly `(0, 1)` | โœ“ | | Baseline `inference.py` with `[START]/[STEP]/[END]` markers | โœ“ |
## Team **Team CookHouse** Sai Jadhav ยท Sameer S Katte Built for the [Meta PyTorch OpenEnv Hackathon](https://pytorch.org/event/openenv-ai-hackathon/), Round 1.
## License MIT โ€” see [LICENSE](LICENSE).
---
*An agent that cannot review code at the level of a senior engineer* *cannot be trusted to write it.* **SecureReview is the benchmark that holds it to that bar.**