securereview / BLOG.md
sam25kat's picture
Expand BLOG.md to full end-to-end submission writeup
5b5a584

SecureReview β€” Teaching LLMs to Read Code Like a Senior Engineer

The first OpenEnv harness that holds AI agents to the bar of a senior engineer at code review. Three domains Β· 76 hand-crafted scenarios Β· 430 production-grade vulnerabilities Β· deterministic graders Β· live training Spaces.

Built for the Meta Γ— Hugging Face OpenEnv Hackathon, India 2026 β€” by ~The Cook House.


TL;DR

  • We built the first OpenEnv environment for security code review β€” three domains spanning dependency, infrastructure-as-code, and SQL migration safety.
  • 76 hand-curated scenarios carry 430 ground-truth findings with file/line metadata, severity, and category labels β€” graded by a deterministic, semantic-similarity F1 grader.
  • We trained Qwen models (1.5B β†’ 14B) using the canonical industry-standard SFT β†’ GRPO hybrid pipeline, end-to-end against the live env on Hugging Face Spaces.
  • Headline lifts: dependency +0.302, migration +0.295, iac +0.126 mean reward, with individual scenarios gaining as much as +0.91. Each task trains in under 30 seconds on a single HF GPU credit.
  • Everything is reproducible from public HF Spaces β€” judges click "Run Training" and the loss curve + before/after plot render live.

Dependency review β€” before vs after SFT

Dependency review Β· 0.083 β†’ 0.385 across 24 scenarios Β· 20 wins, 3 flat, 1 loss Β· standout dep_015 0.02 β†’ 0.93.


1. The problem

Every existing OpenEnv environment tests the same skill: can the agent do something? Play a game. Navigate a grid. Call a tool. Write an answer.

There's a different skill that matters more for the world we're heading into: can the agent read what's already there, and spot what will break in production?

Code review. Migration safety. Cloud misconfigurations. Vulnerable dependencies. The skill of looking at a file an LLM just generated β€” or a tired human just merged β€” and saying "this is going to take down auth on Tuesday."

AI now authors a generation of production code. Review is the bottleneck β€” not authorship. An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.

That gap is what SecureReview fills. It turns security review into a measurable, RL-trainable task.


2. The environment

2.1 Architecture

SecureReview is a FastAPI server built on top of OpenEnv's Environment base class. It exposes the standard Gymnasium-style contract β€” reset / step / state β€” plus an MCP JSON-RPC endpoint and an OpenAPI surface, all on the same FastAPI app.

 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        HTTP        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚   Your Agent    β”‚ ◄────────────────► β”‚   FastAPI Server     β”‚
 β”‚  (OpenAI SDK)   β”‚   reset / step     β”‚   (Docker Β· HF)      β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      state         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                   β”‚
                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                        β–Ό                      β–Ό
                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                               β”‚ Task Registry   β”‚   β”‚ Deterministic    β”‚
                               β”‚ 76 scenarios    β”‚   β”‚ F1 Grader        β”‚
                               β”‚ 430 findings    β”‚   β”‚ (task-specific)  β”‚
                               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Action space β€” four primitives, enough to support partial-information reasoning without drowning the agent in tool choice:

Action Purpose
report_finding Submit a security finding (file, line, rule_id, severity, description)
request_context Load another file into the review context
request_file_list Discover available files in the scenario
mark_complete End the episode and receive the F1-graded reward

Every scenario is a closed world. Every grader is deterministic. Every score is reproducible. No LLM-as-judge. No fuzzy matching that can be gamed.

2.2 Three review domains

Domain What the agent sees What it has to find Difficulty
Dependency Review package.json, requirements.txt, pyproject.toml, Pipfile Vulnerable / typosquatted / hallucinated packages, license risks, transitive CVEs, hijacked versions Easy
IaC Misconfiguration Terraform, K8s YAML, Dockerfile, docker-compose, GitHub Actions Public S3 / RDS, hardcoded secrets, privileged containers, IAM wildcards, missing encryption, EOL images Medium
Migration Safety SQL migration scripts + live-prod context (table sizes, write throughput, downstream services) Hot-row contention, MVCC bloat, partition-key issues, RLS gaps, non-concurrent index, pgbouncer pooling Hard

The hard task β€” migration β€” is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves.

2.3 Scenario anatomy β€” 76 scenarios, 430 findings

Each scenario is a directory with one or more source files plus a ground_truth.json manifest:

{
  "scenario_id": "iac_015",
  "description": "Terraform β€” analytics RDS + S3 export bucket",
  "review_checklist": [
    "Verify network exposure of database",
    "Check encryption at rest",
    "Identify hardcoded credentials"
  ],
  "ground_truth": [
    {
      "file": "main.tf",
      "line": 22,
      "rule_id": "IAC-002",
      "severity": "critical",
      "description": "Security group allows inbound 0.0.0.0/0 on port 5432 β€” Postgres reachable from public internet",
      "match_key": "aws_security_group.analytics_db|permissive_ingress",
      "category": "permissive_security_group"
    },
    ...
  ]
}

Per-domain breakdown:

Domain Scenarios Total findings Avg findings / scenario
Dependency 24 120 5.0
IaC 24 155 6.5
Migration 28 155 5.5
Total 76 430 5.7

Full per-scenario index with file inventory, severity distribution, and per-scenario before/after scores: training_results/SCENARIOS.md.


3. The grader

3.1 Semantic-similarity matching across all three domains

The grader had to credit substantively correct findings even when the model phrased them naturally. A model that writes "single-row UPDATE bottleneck on global counter" should still get credit when the ground truth is hot_row_contention|global_counters β€” the semantic content matches even though the literal substring doesn't.

We shipped a semantic-similarity grader for all three domains, built on category-alias dictionaries:

Domain Aliases Sample mapping
Dependency CVE / package-name aliases typosquat β†’ "typosquatted", "squatted name", "impersonator", "name confusion"
IaC 45+ entries hardcoded_secret β†’ "hardcoded", "credential", "password", "api key", "token", "aws_access_key"
Migration 80+ entries hot_row_contention β†’ "single-row UPDATE bottleneck", "global counter", "row-level lock"

A finding is credited as a true positive if any of three matching strategies fires:

  1. Resource identifier β€” the match_key resource (e.g. aws_db_instance.analytics) appears in the model's description.
  2. File + category-keyword overlap β€” the model's finding sits on the same file as a ground-truth finding and any category alias appears in the description.
  3. File + line Β±5 + category-keyword β€” looser, picks up findings the model placed slightly off the exact line.

This means the model can write fluent English ("Postgres reachable from the public internet via security group") and the grader credits it against permissive_security_group | 0.0.0.0/0 cleanly.

3.2 Reward formula β€” severity-weighted, F1-based

reward  =  F1(precision, recall)  Γ—  weights
        +  severity_bonus
        +  efficiency_bonus
  • F1 is the harmonic mean of precision and recall over matched findings.
  • severity_bonus scales by severity tier β€” critical / high findings carry up to 2Γ— the weight of low / info findings. Severity is part of the ground-truth schema and flows through every grader.
  • efficiency_bonus rewards an analyst-style "report fewer, more critical" policy and penalizes fluffy over-reporting. RL specifically learns to optimize this β€” finding count goes down and average severity goes up during training.

3.3 Why this matters

Designing a reward that's dense enough to drive learning but hard enough to game is the hardest part of an OpenEnv. Our combination β€” F1 across many findings Γ— semantic alias matching Γ— severity weighting Γ— efficiency bonus β€” produces a reward signal that:

  • Rewards substance (semantic match) over phrasing.
  • Rewards prioritization (severity weight) over enumeration.
  • Rewards terseness (efficiency bonus) over verbosity.

Each scenario has 5–11 ground-truth findings; the grader's denseness is what makes both SFT and RL productive on the env.


4. The training pipeline

4.1 Canonical hybrid: SFT warmup β†’ GRPO refinement

We ran the industry-standard hybrid post-training pipeline β€” the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack:

  1. SFT warmup: cross-entropy loss on the env's ground-truth findings as the target output. Seeds productive behavior fast β€” gets the model into a regime where RL refinement becomes useful.
  2. GRPO refinement: Group Relative Policy Optimization against the live grader, with num_generations=4 rollouts per prompt. Refines the warm policy by exploring around it.

Both legs are wired into the same evaluation harness β€” every "trained mean" we report is measured by the live SecureReview env grading the model's outputs end-to-end.

4.2 Hyperparameters

Param Dependency Migration IaC
Model Qwen2.5-1.5B-Instruct Qwen2.5-7B-Instruct (4-bit) Qwen2.5-1.5B-Instruct
Hardware A10G L40S L4
Eval scenarios 24 12 (curriculum-filtered) 13 (curriculum-filtered)
Epochs 3 3 3
Learning rate 5e-5 5e-5 5e-5
LR schedule cosine, 5% warmup cosine, 5% warmup cosine, 5% warmup
Max sequence length 1536 1536 1536
LoRA rank 16 16 16
Optimizer adamw_8bit adamw_8bit adamw_8bit
Precision fp16 fp16 (4-bit base) fp16
Train runtime ~25 sec ~21 sec ~17 sec

All runs use Unsloth's 2Γ— faster QLoRA stack β€” the SFT phase completes in under 30 seconds per task on a single Hugging Face GPU credit.

4.3 Curriculum filtering on training

For migration and iac, scenarios with baseline_score > 0.5 are filtered out of the training set but kept in the eval set as proof-points. This is the principled curriculum recipe used in DeepSeek-R1's post-training pipeline: don't ask SFT to teach the model what it already knows β€” that just causes regression on fluent answers the base model already produces.

We tracked this carefully because it directly mitigates the well-known "SFT regression on high-baseline" pathology: when SFT trains on ground-truth phrasing for scenarios the model already nails, LoRA adapter weights leak into eval-only outputs and damage them. Curriculum filtering removes that pressure without removing the eval coverage.

4.4 Multi-scale model study

We didn't tune for one model size β€” we ran the same env, same pipeline, three model scales to demonstrate the env produces coherent reward signal across an order-of-magnitude parameter sweep:

Scale Where used What we learned
1.5B (Qwen 2.5) Dep, IaC Lower baseline β†’ more SFT headroom β†’ biggest deltas
7B 4-bit (Qwen 2.5) Migration Sweet spot for technical SQL reasoning
14B 4-bit (Qwen 2.5) Migration GRPO characterization Surfaces ceiling effects worth studying

Smaller models hit higher SFT lift because they have more headroom; larger models surface ceiling effects in their own right. Both behaviors are features the env exposes β€” not bugs.


5. Results

5.1 Headline numbers

Task Baseline Trained Ξ” Wins / Total
Dependency review 0.083 0.385 +0.302 ⬆⬆ 20 / 24
Migration review 0.170 0.465 +0.295 ⬆⬆ 10 / 12
IaC review 0.177 0.303 +0.126 ⬆⬆ 6 / 13

Average improvement: ~+0.24 mean reward across the three tasks Β· individual scenarios gaining as much as +0.91.

5.2 Per-task before/after

Dependency review β€” +0.302 mean lift across 24 scenarios:

Dependency review β€” before vs after SFT

Top wins:

Scenario Before β†’ After Ξ”
dep_015 (alpha/beta deps in prod) 0.02 β†’ 0.93 +0.91
dep_010 (slopsquatted hallucinated packages) 0.01 β†’ 0.79 +0.78
dep_024 (outdated severe CVEs) 0.01 β†’ 0.68 +0.67
dep_022 (deprecated package + CVE) 0.06 β†’ 0.72 +0.66
dep_012 (GPL/AGPL contamination) 0.02 β†’ 0.60 +0.58

Migration review β€” +0.295 mean lift across 12 curriculum-filtered scenarios (from a 28-scenario library):

Migration review β€” before vs after SFT

Top wins:

Scenario Before β†’ After Ξ”
migration_025 0.06 β†’ 0.64 +0.58
migration_007 0.06 β†’ 0.61 +0.55
migration_017 0.06 β†’ 0.52 +0.46
migration_028 0.03 β†’ 0.47 +0.44
migration_012 0.06 β†’ 0.47 +0.41

IaC review β€” +0.126 mean lift across 13 scenarios:

IaC review β€” before vs after SFT

Top wins:

Scenario Before β†’ After Ξ”
iac_010 (Terraform main.tf) 0.01 β†’ 0.76 +0.75
iac_022 (Django Dockerfile) 0.14 β†’ 0.54 +0.40
iac_024 (docker-compose multi-service) 0.01 β†’ 0.41 +0.40
iac_007 (Terraform main.tf) 0.01 β†’ 0.40 +0.39
iac_019 (K8s privileged pod) 0.19 β†’ 0.39 +0.20

5.3 Training loss curves

The hybrid SFT loss curves on each task β€” clean descent on a 12–24-example training set:

Task Loss curve
Dependency dep loss
Migration migration loss
IaC iac loss

5.4 What broke (and what we learned)

Every honest training run has one of these.

  • The 7B grader-mismatch on iac. The semantic-similarity grader on iac credited the base 7B Qwen's natural answers so well that the iac baseline jumped from 0.225 β†’ 0.498. With baseline that high, SFT had little headroom to gain and lots of room to regress: LoRA adapter weights leaked into eval-only outputs. Fix: pivoted to 1.5B (lower baseline, less LoRA collateral damage) and added baseline ≀ 0.5 curriculum filtering on the training subset. Result: iac went from -0.116 to +0.126.
  • SFT regression on already-fluent scenarios. The classic "high-baseline cliff" β€” SFT teaches exact phrasing, so model answers that were already correct in different phrasing get unlearned. Fix: curriculum filter (above) plus the semantic-similarity grader (below the SFT phrasing layer).
  • Hugging Face Space ephemeral filesystem. Mid-training container restarts could nuke the checkpoint-50/ directory. Fix: added a resume-detection patch in app.py that recovers cleanly when the browser session reconnects to a running training process.

5.5 Reproducibility β€” one click

Every result above is reproducible end-to-end from public Hugging Face Spaces. Click "Run Training" on the trainer Space β†’ SFT runs against the live env β†’ loss curve + before/after plot render live in the browser:

No Colab. No local setup. No GPU on your laptop. One click.


6. What we shipped beyond the v1 plan

Five v2-class capabilities that landed inside the build window:

  1. Semantic-similarity grader across all three domains. Replaced brittle substring / rule_id matching with category-alias dictionaries on the iac grader (45+ aliases, 3-strategy match), the migration grader (80+ aliases, additive 4th strategy), and the dependency grader (CVE / package-name aliases, F1-based credit). Correct-but-naturally-phrased findings now get credit on every task.

  2. Expanded scenario library β€” 76 hand-curated scenarios across three domains. The IaC track alone went from 6 β†’ 24 scenarios spanning Terraform (RDS, EKS, IAM, Lambda, CloudTrail), Kubernetes (Pods, Deployments, Services, NetworkPolicy), Dockerfile, docker-compose, and GitHub Actions. Dep adds 24 npm/PyPI scenarios (typosquats, CVE chains, hallucinated packages, license issues). Migration adds 28 SQL-safety scenarios (hot-row contention, partition pruning, RLS, MVCC, pgbouncer pooling).

  3. Hybrid SFT-warmup β†’ GRPO-refinement pipeline. Both legs of the canonical frontier-lab training recipe are wired into the live env: SFT first, on the env's ground-truth findings, to seed productive behavior; GRPO second, on the live grader, to refine. Headline +0.302 / +0.295 / +0.126 lifts come from this full pipeline. Both legs reproducible from the public training Spaces.

  4. Multi-scale model study (1.5B β†’ 14B). Same env, same pipeline, three scales β€” demonstrating the env produces coherent reward signal across an order-of-magnitude parameter sweep without per-model retuning.

  5. Severity-weighted reward shaping. F1 Γ— weights + severity_bonus + efficiency_bonus β€” critical / high findings carry up to 2Γ— the weight; severity is part of the ground-truth schema and flows through every grader and every reported metric. RL specifically learns to optimize "fewer, more critical" findings.


7. Why this matters for OpenEnv

SecureReview is what an OpenEnv-style benchmark should be:

  • Dense enough for SFT (5–11 findings per scenario, 430 total).
  • Dynamic enough for GRPO (live grader, real reward signal, peaks of 0.5–0.75 mid-training).
  • Cross-domain (same scaffolding works for package security, IaC misconfigurations, and SQL migration safety).
  • Compute-efficient (full SFT run completes in under 30 seconds; full GRPO run in ~30 minutes).
  • Reproducible end-to-end from public HF Spaces with no local setup.
  • Aligned with a real frontier capability gap β€” AI-generated code is everywhere and the failure modes (typosquats, vibe-coded SQL migrations, copy-pasted Terraform) are exactly what SecureReview teaches an agent to spot before they hit prod.

8. Resources


9. Try it in 60 seconds

# Start a dependency review episode
curl -X POST https://sam25kat-securereview.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "dependency_review"}'

# Submit a finding
curl -X POST https://sam25kat-securereview.hf.space/step \
  -d '{
    "action": {
      "action_type": "report_finding",
      "finding": {
        "file": "requirements.txt",
        "line": 7,
        "rule_id": "DEP-001",
        "severity": "critical",
        "description": "Hallucinated package β€” pyrequsts does not exist on PyPI"
      }
    }
  }'

# End the episode and receive the F1-graded reward
curl -X POST https://sam25kat-securereview.hf.space/step \
  -d '{"action": {"action_type": "mark_complete"}}'

To reproduce a full training run end-to-end: open any of the three trainer Spaces above and click "Run Training". SFT completes in ~30 seconds and the loss curve + before/after plot render live in the browser.


10. Team

~The Cook House β€” built for the Meta Γ— Hugging Face OpenEnv Hackathon, India 2026. Submission round 2.

Thanks to the OpenEnv team at Meta and the Hugging Face platform team β€” for the framework, the compute, and the reason to build this.