securereview / README.md
sam25kat's picture
Repoint blog links to BLOG.md file (per hackathon guidance)
c0449da
---
title: SecureReview
emoji: πŸ›‘
colorFrom: gray
colorTo: indigo
sdk: docker
app_port: 7860
pinned: true
license: mit
tags:
- openenv
- security
- code-review
- agent
- evaluation
- rl
short_description: The agent review benchmark for the age of AI.
---
<div align="center">
<br>
# SecureReview
### *Security review, for the age of AI.*
**The first evaluation harness that holds AI agents to the bar of a senior engineer at code review.**
*Three domains. **76** hand-crafted scenarios. **430** production-grade vulnerabilities.*
*Built for the **Meta Γ— Hugging Face OpenEnv Hackathon** Β· India 2026 β€” by **~The Cook House**.*
<br>
[![OpenEnv](https://img.shields.io/badge/OpenEnv-v1.0-0a0a0a?style=for-the-badge&labelColor=0a0a0a)](https://github.com/meta-pytorch/OpenEnv)
[![Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Live-0a0a0a?style=for-the-badge&labelColor=0a0a0a)](https://huggingface.co/spaces/sam25kat/securereview)
[![Python](https://img.shields.io/badge/Python-3.10+-0a0a0a?style=for-the-badge&logo=python&logoColor=white&labelColor=0a0a0a)](https://python.org)
[![License](https://img.shields.io/badge/License-MIT-0a0a0a?style=for-the-badge&labelColor=0a0a0a)](LICENSE)
<br>
[**Live Environment**](https://sam25kat-securereview.hf.space) Β· [**API Docs**](https://sam25kat-securereview.hf.space/docs) Β· [**Hugging Face Space**](https://huggingface.co/spaces/sam25kat/securereview)
<br>
</div>
---
## Thesis
> **AI now authors a generation of production code. Review is the bottleneck β€” not authorship.**
>
> An agent that cannot review code at the level of a senior engineer cannot be trusted to write it. SecureReview is the benchmark that holds agents to that bar.
Every existing OpenEnv environment tests the same skill: can the agent *do* something? Play a game, navigate a grid, call a tool, write an answer. None of them test the skill that matters most in a world of AI-generated code: **can the agent read what's already there, and spot what will break production?**
This is the category SecureReview opens.
<br>
## The three domains
SecureReview is grounded in three categories of real-world incidents that have cost companies billions. Each maps cleanly to a concrete failure mode that human reviewers catch β€” and that AI-generated code regularly ships anyway.
| | Domain | Real-world precedent |
|---|--------|---------------------|
| **I** | Supply chain compromise | `SolarWinds` Β· `event-stream` Β· `ua-parser-js` |
| **II** | Cloud misconfiguration | `Capital One` Β· every public S3 bucket post-mortem |
| **III** | Unsafe database migrations | `GitHub outages` Β· `Slack incidents` Β· every AWS RCA |
An agent that scores well on SecureReview is an agent you could actually let touch production code.
<br>
## The benchmark
<table>
<tr>
<td width="33%" valign="top">
### I. Dependency &amp; Supply Chain Security
Identify typosquatted packages, hallucinated imports that do not exist on PyPI, and pinned versions with active CVEs.
Tests the baseline of supply-chain literacy every reviewer should have.
`requirements.txt` Β· `package.json`
**24 scenarios Β· 120 findings Β· 15 steps**
##### Easy
</td>
<td width="33%" valign="top">
### II. Infrastructure-as-Code Misconfiguration Detection
Catch CIS-benchmark violations in Terraform and Kubernetes β€” public buckets, wildcard IAM, missing encryption, privileged containers, cross-account trust.
Tests multi-file cloud security reasoning.
Terraform `.tf` Β· Kubernetes YAML
**24 scenarios Β· 155 findings Β· 25 steps**
##### Medium
</td>
<td width="33%" valign="top">
### III. Database Migration Safety Analysis
Reason about SQL migrations against live production context β€” table sizes, write throughput, deployment strategy, downstream services.
Tests the hardest form of review: **judgment**.
Schema Β· migrations Β· app code
**28 scenarios Β· 155 findings Β· 35 steps**
##### Hard
</td>
</tr>
</table>
<br>
## Why it is different
| | Typical OpenEnv environment | SecureReview |
|---|---|---|
| **Task** | Game, toy, synthetic | Real production artifact |
| **Skill tested** | Acting in the world | Reading the world |
| **Ground truth** | Game rules | Senior-engineer judgment |
| **Reward** | Game score | Deterministic F1 over planted vulnerabilities |
| **Transfer** | To more games | To shipping code in production |
<br>
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” HTTP β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ ◄────────────────► β”‚ β”‚
β”‚ Your Agent β”‚ reset / step β”‚ FastAPI Server β”‚
β”‚ (OpenAI SDK) β”‚ state β”‚ (Docker Β· HF) β”‚
β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Task Registry β”‚ β”‚ Deterministic β”‚
β”‚ 76 scenarios β”‚ β”‚ F1 Grader β”‚
β”‚ 430 findings β”‚ β”‚ (task-specific) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
Every scenario is a closed world. Every grader is deterministic. Every score is reproducible. No LLM-as-judge. No fuzzy matching that can be gamed.
<br>
## Action space
Four primitives. Enough to support partial-information reasoning without drowning the agent in tool choice.
```python
class Action:
action_type: Literal[
"report_finding", # submit a security finding
"request_context", # load another file into the review context
"request_file_list", # discover available files
"mark_complete", # end the episode and trigger grading
]
finding: Optional[Finding] # required for report_finding
filename: Optional[str] # required for request_context
```
Every `Finding` is a typed record: `file`, `line`, `rule_id`, `severity`, `description`. The agent reports as many as its step budget allows.
<br>
## Reward
```
score = F1(precision, recall) Γ— 0.83
+ severity_bonus (≀ 0.10)
+ efficiency_bonus (≀ 0.05)
+ participation_bonus (= 0.01)
βˆ’ false_positive_penalty (≀ 0.20)
```
Clamped strictly to the open interval `(0.01, 0.99)`. Deterministic and reproducible.
#### Matching strategy
| Task | Primary match | Fallback |
|------|---------------|----------|
| `dependency_review` | Package name in description | Line number |
| `iac_review` | `(resource_id, rule_category)` | File + category |
| `migration_review` | `(operation, target_object)` | Line + rule_id |
<br>
## Quick start
#### Against the hosted environment
```python
import requests
ENV = "https://sam25kat-securereview.hf.space"
# Start an episode
r = requests.post(f"{ENV}/reset", json={"task_id": "dependency_review"})
observation = r.json()["observation"]
# Report a finding
action = {
"action_type": "report_finding",
"finding": {
"file": "requirements.txt",
"line": 2,
"rule_id": "DEP-002",
"severity": "critical",
"description": "Typosquat: 'reqeusts' is a misspelling of 'requests'",
},
}
requests.post(f"{ENV}/step", json={"action": action})
# End the episode and receive the final score
r = requests.post(f"{ENV}/step", json={"action": {"action_type": "mark_complete"}})
print(f"score = {r.json()['reward']}")
```
#### Run the baseline agent
```bash
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="deepseek-ai/DeepSeek-V3-0324"
export HF_TOKEN="hf_..."
export ENV_URL="https://sam25kat-securereview.hf.space"
python inference.py
```
#### Run locally with Docker
```bash
docker build -t securereview .
docker run -p 7860:7860 securereview
```
<br>
## Interface
| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/` | Landing page |
| `GET` | `/health` | Health check |
| `GET` | `/tasks` | List available tasks |
| `GET` | `/metadata` | Environment metadata |
| `GET` | `/schema` | Action / observation / state JSON schemas |
| `GET` | `/state` | Current episode state |
| `GET` | `/docs` | OpenAPI interactive docs |
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Execute an action |
| `POST` | `/mcp` | JSON-RPC 2.0 MCP endpoint |
<br>
## Baseline
Evaluated against the live Space with `deepseek-ai/DeepSeek-V3-0324` via the Hugging Face Inference Router.
| Task | Difficulty | Score |
|------|:----------:|:-----:|
| `dependency_review` | Easy | `0.45` |
| `iac_review` | Medium | `0.52` |
| `migration_review` | Hard | `0.05` |
| **Average** | | **`0.34`** |
Oracle reference (agent submitting ground-truth findings): **`0.98`** β€” validates grader correctness.
The hard task is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves.
<br>
## Training results
We trained models on the live environment using the **canonical industry-standard hybrid pipeline β€” SFT warmup β†’ GRPO refinement** β€” the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack. Same env, same evaluation harness, end-to-end against the live grader.
| Task | Method | Baseline | Trained | **Improvement** | Wins |
|------|:-------|:--------:|:-------:|:---------------:|:----:|
| `dependency_review` | SFTβ†’GRPO (Qwen 1.5B, 24 scenarios, 3 epochs) | `0.083` | `0.385` | **+0.302** ⬆⬆ | 20/24 |
| `migration_review` | SFTβ†’GRPO (Qwen 7B, 12 scenarios, 3 epochs) | `0.170` | `0.465` | **+0.295** ⬆⬆ | 10/12 |
| `iac_review` | SFTβ†’GRPO (Qwen 1.5B, 13 scenarios, 3 epochs) | `0.177` | `0.303` | **+0.126** ⬆⬆ | 6/13 |
Average improvement across tasks: **~+0.24 mean reward**, with individual scenarios gaining as much as **+0.91**. Training took **under 30 seconds** per task on a single GPU (A10G / L40S / L4).
### Per-task before/after
**Dependency review** β€” `+0.302` mean lift across 24 scenarios:
![Dependency review β€” before vs after SFT](training_results/plots/dep/before_after.png)
**Migration review** β€” `+0.295` mean lift across 12 scenarios:
![Migration review β€” before vs after SFT](training_results/plots/migration/before_after.png)
**IaC review** β€” `+0.126` mean lift across 13 scenarios:
![IaC review β€” before vs after SFT](training_results/plots/iac/before_after.png)
The full story β€” per-scenario breakdowns, training loss curves, hyperparameter sweeps, scenario-curriculum design, and engineering tradeoffs β€” is in [training_results/RESULTS.md](training_results/RESULTS.md).
Reproducible training scripts are at [training_space/](training_space/) and the live trainer Spaces:
- [securereview-trainer](https://huggingface.co/spaces/sam25kat/securereview-trainer) (dependency_review)
- [securereview-trainer-migration](https://huggingface.co/spaces/sam25kat/securereview-trainer-migration)
- [securereview-trainer-iac](https://huggingface.co/spaces/sam25kat/securereview-trainer-iac)
<br>
## Blog & writeup
- **Mini-blog**: [BLOG.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/BLOG.md) β€” submission writeup with problem, env, training pipeline, and results. Lives as a separate MD file at the root of the HF Space, per hackathon submission guidance.
- **Mirror discussion**: [HF community thread](https://huggingface.co/spaces/sam25kat/securereview/discussions/1#69edb47ef08cea42f0f4df3a) β€” same content posted to the Space's Community tab for visibility.
- **Full results**: [training_results/RESULTS.md](training_results/RESULTS.md)
- **Complete scenario index** (all 76): [training_results/SCENARIOS.md](training_results/SCENARIOS.md) β€” file inventory, severity distribution, categories, per-scenario before/after.
- **Plots**: [training_results/plots/](training_results/plots/) β€” committed PNGs for all three tasks (before/after + training loss).
- **Per-task summaries**:
[dep](training_results/dep_sft_summary.md) Β·
[migration](training_results/migration_sft_summary.md) Β·
[iac](training_results/iac_sft_summary.md)
<br>
## Project structure
```
securereview/
β”œβ”€β”€ app/
β”‚ β”œβ”€β”€ main.py FastAPI endpoints
β”‚ β”œβ”€β”€ landing.py Premium HTML landing page
β”‚ β”œβ”€β”€ environment.py Episode state machine
β”‚ β”œβ”€β”€ models.py Pydantic types
β”‚ β”œβ”€β”€ graders/
β”‚ β”‚ β”œβ”€β”€ base.py F1 + severity + efficiency scoring
β”‚ β”‚ β”œβ”€β”€ dependency_grader.py
β”‚ β”‚ β”œβ”€β”€ iac_grader.py
β”‚ β”‚ └── migration_grader.py
β”‚ └── tasks/
β”‚ β”œβ”€β”€ task_registry.py Scenario discovery
β”‚ └── scenarios/ 76 hand-crafted scenarios
β”‚ β”œβ”€β”€ dependency/ 24 scenarios
β”‚ β”œβ”€β”€ iac/ 24 scenarios
β”‚ └── migration/ 28 scenarios
β”‚
β”œβ”€β”€ server/
β”‚ └── app.py OpenEnv multi-mode entry point
β”œβ”€β”€ inference.py Baseline agent (OpenAI client)
β”œβ”€β”€ openenv.yaml Environment manifest
β”œβ”€β”€ pyproject.toml Package definition
β”œβ”€β”€ uv.lock Reproducible dependency lock
└── Dockerfile
```
<br>
## OpenEnv compliance
| Check | Status |
|-------|:------:|
| `openenv validate .` (local) | βœ“ |
| `openenv validate --url` (runtime) | βœ“ |
| Docker build | βœ“ |
| Multi-mode deployment (`docker`, `uv_run`, `python_module`, `openenv_serve`) | βœ“ |
| Hugging Face Space deploys | βœ“ |
| `/health`, `/metadata`, `/schema`, `/mcp`, `/reset`, `/step`, `/state` | βœ“ |
| Typed Pydantic action / observation / state | βœ“ |
| Deterministic grader, strictly `(0, 1)` | βœ“ |
| Baseline `inference.py` with `[START]/[STEP]/[END]` markers | βœ“ |
<br>
## Team
**Team CookHouse**
Sai Jadhav Β· Sameer S Katte
Built for the [Meta PyTorch OpenEnv Hackathon](https://pytorch.org/event/openenv-ai-hackathon/), Round 1.
<br>
## License
MIT β€” see [LICENSE](LICENSE).
<br>
---
<div align="center">
*An agent that cannot review code at the level of a senior engineer*
*cannot be trusted to write it.*
**SecureReview is the benchmark that holds it to that bar.**
<br>
</div>