Spaces:
Sleeping
title: SecureReview
emoji: π‘
colorFrom: gray
colorTo: indigo
sdk: docker
app_port: 7860
pinned: true
license: mit
tags:
- openenv
- security
- code-review
- agent
- evaluation
- rl
short_description: The agent review benchmark for the age of AI.
SecureReview
Security review, for the age of AI.
The first evaluation harness that holds AI agents to the bar of a senior engineer at code review. Three domains. 76 hand-crafted scenarios. 430 production-grade vulnerabilities.
Built for the Meta Γ Hugging Face OpenEnv Hackathon Β· India 2026 β by ~The Cook House.
Live Environment Β· API Docs Β· Hugging Face Space
Thesis
AI now authors a generation of production code. Review is the bottleneck β not authorship.
An agent that cannot review code at the level of a senior engineer cannot be trusted to write it. SecureReview is the benchmark that holds agents to that bar.
Every existing OpenEnv environment tests the same skill: can the agent do something? Play a game, navigate a grid, call a tool, write an answer. None of them test the skill that matters most in a world of AI-generated code: can the agent read what's already there, and spot what will break production?
This is the category SecureReview opens.
The three domains
SecureReview is grounded in three categories of real-world incidents that have cost companies billions. Each maps cleanly to a concrete failure mode that human reviewers catch β and that AI-generated code regularly ships anyway.
| Domain | Real-world precedent | |
|---|---|---|
| I | Supply chain compromise | SolarWinds Β· event-stream Β· ua-parser-js |
| II | Cloud misconfiguration | Capital One Β· every public S3 bucket post-mortem |
| III | Unsafe database migrations | GitHub outages Β· Slack incidents Β· every AWS RCA |
An agent that scores well on SecureReview is an agent you could actually let touch production code.
The benchmark
Why it is different
| Typical OpenEnv environment | SecureReview | |
|---|---|---|
| Task | Game, toy, synthetic | Real production artifact |
| Skill tested | Acting in the world | Reading the world |
| Ground truth | Game rules | Senior-engineer judgment |
| Reward | Game score | Deterministic F1 over planted vulnerabilities |
| Transfer | To more games | To shipping code in production |
Architecture
βββββββββββββββββββ HTTP ββββββββββββββββββββββββ
β β ββββββββββββββββββΊ β β
β Your Agent β reset / step β FastAPI Server β
β (OpenAI SDK) β state β (Docker Β· HF) β
β β β β
βββββββββββββββββββ ββββββββββββ¬ββββββββββββ
β
ββββββββββββ΄ββββββββββββ
β β
βΌ βΌ
βββββββββββββββββββ ββββββββββββββββββββ
β Task Registry β β Deterministic β
β 76 scenarios β β F1 Grader β
β 430 findings β β (task-specific) β
βββββββββββββββββββ ββββββββββββββββββββ
Every scenario is a closed world. Every grader is deterministic. Every score is reproducible. No LLM-as-judge. No fuzzy matching that can be gamed.
Action space
Four primitives. Enough to support partial-information reasoning without drowning the agent in tool choice.
class Action:
action_type: Literal[
"report_finding", # submit a security finding
"request_context", # load another file into the review context
"request_file_list", # discover available files
"mark_complete", # end the episode and trigger grading
]
finding: Optional[Finding] # required for report_finding
filename: Optional[str] # required for request_context
Every Finding is a typed record: file, line, rule_id, severity, description. The agent reports as many as its step budget allows.
Reward
score = F1(precision, recall) Γ 0.83
+ severity_bonus (β€ 0.10)
+ efficiency_bonus (β€ 0.05)
+ participation_bonus (= 0.01)
β false_positive_penalty (β€ 0.20)
Clamped strictly to the open interval (0.01, 0.99). Deterministic and reproducible.
Matching strategy
| Task | Primary match | Fallback |
|---|---|---|
dependency_review |
Package name in description | Line number |
iac_review |
(resource_id, rule_category) |
File + category |
migration_review |
(operation, target_object) |
Line + rule_id |
Quick start
Against the hosted environment
import requests
ENV = "https://sam25kat-securereview.hf.space"
# Start an episode
r = requests.post(f"{ENV}/reset", json={"task_id": "dependency_review"})
observation = r.json()["observation"]
# Report a finding
action = {
"action_type": "report_finding",
"finding": {
"file": "requirements.txt",
"line": 2,
"rule_id": "DEP-002",
"severity": "critical",
"description": "Typosquat: 'reqeusts' is a misspelling of 'requests'",
},
}
requests.post(f"{ENV}/step", json={"action": action})
# End the episode and receive the final score
r = requests.post(f"{ENV}/step", json={"action": {"action_type": "mark_complete"}})
print(f"score = {r.json()['reward']}")
Run the baseline agent
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="deepseek-ai/DeepSeek-V3-0324"
export HF_TOKEN="hf_..."
export ENV_URL="https://sam25kat-securereview.hf.space"
python inference.py
Run locally with Docker
docker build -t securereview .
docker run -p 7860:7860 securereview
Interface
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Landing page |
GET |
/health |
Health check |
GET |
/tasks |
List available tasks |
GET |
/metadata |
Environment metadata |
GET |
/schema |
Action / observation / state JSON schemas |
GET |
/state |
Current episode state |
GET |
/docs |
OpenAPI interactive docs |
POST |
/reset |
Start a new episode |
POST |
/step |
Execute an action |
POST |
/mcp |
JSON-RPC 2.0 MCP endpoint |
Baseline
Evaluated against the live Space with deepseek-ai/DeepSeek-V3-0324 via the Hugging Face Inference Router.
| Task | Difficulty | Score |
|---|---|---|
dependency_review |
Easy | 0.45 |
iac_review |
Medium | 0.52 |
migration_review |
Hard | 0.05 |
| Average | 0.34 |
Oracle reference (agent submitting ground-truth findings): 0.98 β validates grader correctness.
The hard task is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves.
Training results
We trained models on the live environment using the canonical industry-standard hybrid pipeline β SFT warmup β GRPO refinement β the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack. Same env, same evaluation harness, end-to-end against the live grader.
| Task | Method | Baseline | Trained | Improvement | Wins |
|---|---|---|---|---|---|
dependency_review |
SFTβGRPO (Qwen 1.5B, 24 scenarios, 3 epochs) | 0.083 |
0.385 |
+0.302 β¬β¬ | 20/24 |
migration_review |
SFTβGRPO (Qwen 7B, 12 scenarios, 3 epochs) | 0.170 |
0.465 |
+0.295 β¬β¬ | 10/12 |
iac_review |
SFTβGRPO (Qwen 1.5B, 13 scenarios, 3 epochs) | 0.177 |
0.303 |
+0.126 β¬β¬ | 6/13 |
Average improvement across tasks: ~+0.24 mean reward, with individual scenarios gaining as much as +0.91. Training took under 30 seconds per task on a single GPU (A10G / L40S / L4).
Per-task before/after
Dependency review β +0.302 mean lift across 24 scenarios:
Migration review β +0.295 mean lift across 12 scenarios:
IaC review β +0.126 mean lift across 13 scenarios:
The full story β per-scenario breakdowns, training loss curves, hyperparameter sweeps, scenario-curriculum design, and engineering tradeoffs β is in training_results/RESULTS.md.
Reproducible training scripts are at training_space/ and the live trainer Spaces:
- securereview-trainer (dependency_review)
- securereview-trainer-migration
- securereview-trainer-iac
Blog & writeup
- Mini-blog: BLOG.md β submission writeup with problem, env, training pipeline, and results. Lives as a separate MD file at the root of the HF Space, per hackathon submission guidance.
- Mirror discussion: HF community thread β same content posted to the Space's Community tab for visibility.
- Full results: training_results/RESULTS.md
- Complete scenario index (all 76): training_results/SCENARIOS.md β file inventory, severity distribution, categories, per-scenario before/after.
- Plots: training_results/plots/ β committed PNGs for all three tasks (before/after + training loss).
- Per-task summaries: dep Β· migration Β· iac
Project structure
securereview/
βββ app/
β βββ main.py FastAPI endpoints
β βββ landing.py Premium HTML landing page
β βββ environment.py Episode state machine
β βββ models.py Pydantic types
β βββ graders/
β β βββ base.py F1 + severity + efficiency scoring
β β βββ dependency_grader.py
β β βββ iac_grader.py
β β βββ migration_grader.py
β βββ tasks/
β βββ task_registry.py Scenario discovery
β βββ scenarios/ 76 hand-crafted scenarios
β βββ dependency/ 24 scenarios
β βββ iac/ 24 scenarios
β βββ migration/ 28 scenarios
β
βββ server/
β βββ app.py OpenEnv multi-mode entry point
βββ inference.py Baseline agent (OpenAI client)
βββ openenv.yaml Environment manifest
βββ pyproject.toml Package definition
βββ uv.lock Reproducible dependency lock
βββ Dockerfile
OpenEnv compliance
| Check | Status |
|---|---|
openenv validate . (local) |
β |
openenv validate --url (runtime) |
β |
| Docker build | β |
Multi-mode deployment (docker, uv_run, python_module, openenv_serve) |
β |
| Hugging Face Space deploys | β |
/health, /metadata, /schema, /mcp, /reset, /step, /state |
β |
| Typed Pydantic action / observation / state | β |
Deterministic grader, strictly (0, 1) |
β |
Baseline inference.py with [START]/[STEP]/[END] markers |
β |
Team
Team CookHouse Sai Jadhav Β· Sameer S Katte
Built for the Meta PyTorch OpenEnv Hackathon, Round 1.
License
MIT β see LICENSE.
An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.
SecureReview is the benchmark that holds it to that bar.


