securereview / README.md
sam25kat's picture
Repoint blog links to BLOG.md file (per hackathon guidance)
c0449da
metadata
title: SecureReview
emoji: πŸ›‘
colorFrom: gray
colorTo: indigo
sdk: docker
app_port: 7860
pinned: true
license: mit
tags:
  - openenv
  - security
  - code-review
  - agent
  - evaluation
  - rl
short_description: The agent review benchmark for the age of AI.

SecureReview

Security review, for the age of AI.

The first evaluation harness that holds AI agents to the bar of a senior engineer at code review. Three domains. 76 hand-crafted scenarios. 430 production-grade vulnerabilities.

Built for the Meta Γ— Hugging Face OpenEnv Hackathon Β· India 2026 β€” by ~The Cook House.


OpenEnv Hugging Face Python License


Live Environment Β· API Docs Β· Hugging Face Space



Thesis

AI now authors a generation of production code. Review is the bottleneck β€” not authorship.

An agent that cannot review code at the level of a senior engineer cannot be trusted to write it. SecureReview is the benchmark that holds agents to that bar.

Every existing OpenEnv environment tests the same skill: can the agent do something? Play a game, navigate a grid, call a tool, write an answer. None of them test the skill that matters most in a world of AI-generated code: can the agent read what's already there, and spot what will break production?

This is the category SecureReview opens.


The three domains

SecureReview is grounded in three categories of real-world incidents that have cost companies billions. Each maps cleanly to a concrete failure mode that human reviewers catch β€” and that AI-generated code regularly ships anyway.

Domain Real-world precedent
I Supply chain compromise SolarWinds Β· event-stream Β· ua-parser-js
II Cloud misconfiguration Capital One Β· every public S3 bucket post-mortem
III Unsafe database migrations GitHub outages Β· Slack incidents Β· every AWS RCA

An agent that scores well on SecureReview is an agent you could actually let touch production code.


The benchmark

I. Dependency & Supply Chain Security

Identify typosquatted packages, hallucinated imports that do not exist on PyPI, and pinned versions with active CVEs.

Tests the baseline of supply-chain literacy every reviewer should have.

requirements.txt Β· package.json 24 scenarios Β· 120 findings Β· 15 steps

Easy

II. Infrastructure-as-Code Misconfiguration Detection

Catch CIS-benchmark violations in Terraform and Kubernetes β€” public buckets, wildcard IAM, missing encryption, privileged containers, cross-account trust.

Tests multi-file cloud security reasoning.

Terraform .tf Β· Kubernetes YAML 24 scenarios Β· 155 findings Β· 25 steps

Medium

III. Database Migration Safety Analysis

Reason about SQL migrations against live production context β€” table sizes, write throughput, deployment strategy, downstream services.

Tests the hardest form of review: judgment.

Schema Β· migrations Β· app code 28 scenarios Β· 155 findings Β· 35 steps

Hard

Why it is different

Typical OpenEnv environment SecureReview
Task Game, toy, synthetic Real production artifact
Skill tested Acting in the world Reading the world
Ground truth Game rules Senior-engineer judgment
Reward Game score Deterministic F1 over planted vulnerabilities
Transfer To more games To shipping code in production

Architecture

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        HTTP        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚                 β”‚ ◄────────────────► β”‚                      β”‚
 β”‚   Your Agent    β”‚   reset / step     β”‚   FastAPI Server     β”‚
 β”‚  (OpenAI SDK)   β”‚      state         β”‚   (Docker Β· HF)      β”‚
 β”‚                 β”‚                    β”‚                      β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                   β”‚
                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β”‚                      β”‚
                                        β–Ό                      β–Ό
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚ Task Registry   β”‚   β”‚ Deterministic    β”‚
                               β”‚ 76 scenarios    β”‚   β”‚ F1 Grader        β”‚
                               β”‚ 430 findings    β”‚   β”‚ (task-specific)  β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Every scenario is a closed world. Every grader is deterministic. Every score is reproducible. No LLM-as-judge. No fuzzy matching that can be gamed.


Action space

Four primitives. Enough to support partial-information reasoning without drowning the agent in tool choice.

class Action:
    action_type: Literal[
        "report_finding",       # submit a security finding
        "request_context",      # load another file into the review context
        "request_file_list",    # discover available files
        "mark_complete",        # end the episode and trigger grading
    ]
    finding:  Optional[Finding]   # required for report_finding
    filename: Optional[str]       # required for request_context

Every Finding is a typed record: file, line, rule_id, severity, description. The agent reports as many as its step budget allows.


Reward

score  =  F1(precision, recall) Γ— 0.83
       +  severity_bonus          (≀ 0.10)
       +  efficiency_bonus        (≀ 0.05)
       +  participation_bonus     (= 0.01)
       βˆ’  false_positive_penalty  (≀ 0.20)

Clamped strictly to the open interval (0.01, 0.99). Deterministic and reproducible.

Matching strategy

Task Primary match Fallback
dependency_review Package name in description Line number
iac_review (resource_id, rule_category) File + category
migration_review (operation, target_object) Line + rule_id

Quick start

Against the hosted environment

import requests

ENV = "https://sam25kat-securereview.hf.space"

# Start an episode
r = requests.post(f"{ENV}/reset", json={"task_id": "dependency_review"})
observation = r.json()["observation"]

# Report a finding
action = {
    "action_type": "report_finding",
    "finding": {
        "file": "requirements.txt",
        "line": 2,
        "rule_id": "DEP-002",
        "severity": "critical",
        "description": "Typosquat: 'reqeusts' is a misspelling of 'requests'",
    },
}
requests.post(f"{ENV}/step", json={"action": action})

# End the episode and receive the final score
r = requests.post(f"{ENV}/step", json={"action": {"action_type": "mark_complete"}})
print(f"score = {r.json()['reward']}")

Run the baseline agent

export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="deepseek-ai/DeepSeek-V3-0324"
export HF_TOKEN="hf_..."
export ENV_URL="https://sam25kat-securereview.hf.space"

python inference.py

Run locally with Docker

docker build -t securereview .
docker run -p 7860:7860 securereview

Interface

Method Endpoint Description
GET / Landing page
GET /health Health check
GET /tasks List available tasks
GET /metadata Environment metadata
GET /schema Action / observation / state JSON schemas
GET /state Current episode state
GET /docs OpenAPI interactive docs
POST /reset Start a new episode
POST /step Execute an action
POST /mcp JSON-RPC 2.0 MCP endpoint

Baseline

Evaluated against the live Space with deepseek-ai/DeepSeek-V3-0324 via the Hugging Face Inference Router.

Task Difficulty Score
dependency_review Easy 0.45
iac_review Medium 0.52
migration_review Hard 0.05
Average 0.34

Oracle reference (agent submitting ground-truth findings): 0.98 β€” validates grader correctness.

The hard task is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves.


Training results

We trained models on the live environment using the canonical industry-standard hybrid pipeline β€” SFT warmup β†’ GRPO refinement β€” the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack. Same env, same evaluation harness, end-to-end against the live grader.

Task Method Baseline Trained Improvement Wins
dependency_review SFTβ†’GRPO (Qwen 1.5B, 24 scenarios, 3 epochs) 0.083 0.385 +0.302 ⬆⬆ 20/24
migration_review SFTβ†’GRPO (Qwen 7B, 12 scenarios, 3 epochs) 0.170 0.465 +0.295 ⬆⬆ 10/12
iac_review SFTβ†’GRPO (Qwen 1.5B, 13 scenarios, 3 epochs) 0.177 0.303 +0.126 ⬆⬆ 6/13

Average improvement across tasks: ~+0.24 mean reward, with individual scenarios gaining as much as +0.91. Training took under 30 seconds per task on a single GPU (A10G / L40S / L4).

Per-task before/after

Dependency review β€” +0.302 mean lift across 24 scenarios:

Dependency review β€” before vs after SFT

Migration review β€” +0.295 mean lift across 12 scenarios:

Migration review β€” before vs after SFT

IaC review β€” +0.126 mean lift across 13 scenarios:

IaC review β€” before vs after SFT

The full story β€” per-scenario breakdowns, training loss curves, hyperparameter sweeps, scenario-curriculum design, and engineering tradeoffs β€” is in training_results/RESULTS.md.

Reproducible training scripts are at training_space/ and the live trainer Spaces:


Blog & writeup

  • Mini-blog: BLOG.md β€” submission writeup with problem, env, training pipeline, and results. Lives as a separate MD file at the root of the HF Space, per hackathon submission guidance.
  • Mirror discussion: HF community thread β€” same content posted to the Space's Community tab for visibility.
  • Full results: training_results/RESULTS.md
  • Complete scenario index (all 76): training_results/SCENARIOS.md β€” file inventory, severity distribution, categories, per-scenario before/after.
  • Plots: training_results/plots/ β€” committed PNGs for all three tasks (before/after + training loss).
  • Per-task summaries: dep Β· migration Β· iac

Project structure

securereview/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py                FastAPI endpoints
β”‚   β”œβ”€β”€ landing.py             Premium HTML landing page
β”‚   β”œβ”€β”€ environment.py         Episode state machine
β”‚   β”œβ”€β”€ models.py              Pydantic types
β”‚   β”œβ”€β”€ graders/
β”‚   β”‚   β”œβ”€β”€ base.py            F1 + severity + efficiency scoring
β”‚   β”‚   β”œβ”€β”€ dependency_grader.py
β”‚   β”‚   β”œβ”€β”€ iac_grader.py
β”‚   β”‚   └── migration_grader.py
β”‚   └── tasks/
β”‚       β”œβ”€β”€ task_registry.py   Scenario discovery
β”‚       └── scenarios/         76 hand-crafted scenarios
β”‚           β”œβ”€β”€ dependency/    24 scenarios
β”‚           β”œβ”€β”€ iac/           24 scenarios
β”‚           └── migration/     28 scenarios
β”‚
β”œβ”€β”€ server/
β”‚   └── app.py                 OpenEnv multi-mode entry point
β”œβ”€β”€ inference.py               Baseline agent (OpenAI client)
β”œβ”€β”€ openenv.yaml               Environment manifest
β”œβ”€β”€ pyproject.toml             Package definition
β”œβ”€β”€ uv.lock                    Reproducible dependency lock
└── Dockerfile

OpenEnv compliance

Check Status
openenv validate . (local) βœ“
openenv validate --url (runtime) βœ“
Docker build βœ“
Multi-mode deployment (docker, uv_run, python_module, openenv_serve) βœ“
Hugging Face Space deploys βœ“
/health, /metadata, /schema, /mcp, /reset, /step, /state βœ“
Typed Pydantic action / observation / state βœ“
Deterministic grader, strictly (0, 1) βœ“
Baseline inference.py with [START]/[STEP]/[END] markers βœ“

Team

Team CookHouse Sai Jadhav Β· Sameer S Katte

Built for the Meta PyTorch OpenEnv Hackathon, Round 1.


License

MIT β€” see LICENSE.



An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.

SecureReview is the benchmark that holds it to that bar.