Spaces:
Sleeping
Sleeping
| # SecureReview β Teaching LLMs to Read Code Like a Senior Engineer | |
| > **The first OpenEnv harness that holds AI agents to the bar of a senior engineer at code review.** | |
| > Three domains Β· 76 hand-crafted scenarios Β· 430 production-grade vulnerabilities Β· deterministic graders Β· live training Spaces. | |
| *Built for the **Meta Γ Hugging Face OpenEnv Hackathon**, India 2026 β by **~The Cook House**.* | |
| - π’ **Live env**: https://huggingface.co/spaces/sam25kat/securereview | |
| - π§ͺ **One-click trainers** (SFTβGRPO hybrid pipeline): | |
| [dep](https://huggingface.co/spaces/sam25kat/securereview-trainer) Β· [migration](https://huggingface.co/spaces/sam25kat/securereview-trainer-migration) Β· [iac](https://huggingface.co/spaces/sam25kat/securereview-trainer-iac) | |
| - π οΈ **Code**: https://github.com/sam25kat/Secure_Reveiw | |
| - π **Full results**: [training_results/RESULTS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/RESULTS.md) Β· [SCENARIOS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/SCENARIOS.md) | |
| --- | |
| ## TL;DR | |
| - We built the first **OpenEnv environment for security code review** β three domains spanning dependency, infrastructure-as-code, and SQL migration safety. | |
| - **76 hand-curated scenarios** carry **430 ground-truth findings** with file/line metadata, severity, and category labels β graded by a deterministic, semantic-similarity F1 grader. | |
| - We trained Qwen models (1.5B β 14B) using the canonical industry-standard **SFT β GRPO hybrid pipeline**, end-to-end against the live env on Hugging Face Spaces. | |
| - **Headline lifts**: dependency `+0.302`, migration `+0.295`, iac `+0.126` mean reward, with individual scenarios gaining as much as **+0.91**. Each task trains in **under 30 seconds** on a single HF GPU credit. | |
| - Everything is reproducible from public HF Spaces β judges click "Run Training" and the loss curve + before/after plot render live. | |
|  | |
| *Dependency review Β· 0.083 β 0.385 across 24 scenarios Β· 20 wins, 3 flat, 1 loss Β· standout dep_015 0.02 β 0.93.* | |
| --- | |
| ## 1. The problem | |
| Every existing OpenEnv environment tests the same skill: *can the agent **do** something?* Play a game. Navigate a grid. Call a tool. Write an answer. | |
| There's a different skill that matters more for the world we're heading into: **can the agent read what's already there, and spot what will break in production?** | |
| Code review. Migration safety. Cloud misconfigurations. Vulnerable dependencies. The skill of looking at a file an LLM just generated β or a tired human just merged β and saying *"this is going to take down auth on Tuesday."* | |
| > **AI now authors a generation of production code. Review is the bottleneck β not authorship. An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.** | |
| That gap is what **SecureReview** fills. It turns security review into a measurable, RL-trainable task. | |
| --- | |
| ## 2. The environment | |
| ### 2.1 Architecture | |
| SecureReview is a FastAPI server built on top of OpenEnv's `Environment` base class. It exposes the standard Gymnasium-style contract β `reset / step / state` β plus an MCP JSON-RPC endpoint and an OpenAPI surface, all on the same FastAPI app. | |
| ``` | |
| βββββββββββββββββββ HTTP ββββββββββββββββββββββββ | |
| β Your Agent β ββββββββββββββββββΊ β FastAPI Server β | |
| β (OpenAI SDK) β reset / step β (Docker Β· HF) β | |
| βββββββββββββββββββ state ββββββββββββ¬ββββββββββββ | |
| β | |
| ββββββββββββ΄ββββββββββββ | |
| βΌ βΌ | |
| βββββββββββββββββββ ββββββββββββββββββββ | |
| β Task Registry β β Deterministic β | |
| β 76 scenarios β β F1 Grader β | |
| β 430 findings β β (task-specific) β | |
| βββββββββββββββββββ ββββββββββββββββββββ | |
| ``` | |
| **Action space** β four primitives, enough to support partial-information reasoning without drowning the agent in tool choice: | |
| | Action | Purpose | | |
| |---|---| | |
| | `report_finding` | Submit a security finding (file, line, rule_id, severity, description) | | |
| | `request_context` | Load another file into the review context | | |
| | `request_file_list` | Discover available files in the scenario | | |
| | `mark_complete` | End the episode and receive the F1-graded reward | | |
| Every scenario is a **closed world**. Every grader is **deterministic**. Every score is **reproducible**. No LLM-as-judge. No fuzzy matching that can be gamed. | |
| ### 2.2 Three review domains | |
| | Domain | What the agent sees | What it has to find | Difficulty | | |
| |---|---|---|:---:| | |
| | **Dependency Review** | `package.json`, `requirements.txt`, `pyproject.toml`, `Pipfile` | Vulnerable / typosquatted / hallucinated packages, license risks, transitive CVEs, hijacked versions | Easy | | |
| | **IaC Misconfiguration** | Terraform, K8s YAML, Dockerfile, docker-compose, GitHub Actions | Public S3 / RDS, hardcoded secrets, privileged containers, IAM wildcards, missing encryption, EOL images | Medium | | |
| | **Migration Safety** | SQL migration scripts + live-prod context (table sizes, write throughput, downstream services) | Hot-row contention, MVCC bloat, partition-key issues, RLS gaps, non-concurrent index, pgbouncer pooling | Hard | | |
| The hard task β migration β is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves. | |
| ### 2.3 Scenario anatomy β 76 scenarios, 430 findings | |
| Each scenario is a directory with one or more source files plus a `ground_truth.json` manifest: | |
| ```json | |
| { | |
| "scenario_id": "iac_015", | |
| "description": "Terraform β analytics RDS + S3 export bucket", | |
| "review_checklist": [ | |
| "Verify network exposure of database", | |
| "Check encryption at rest", | |
| "Identify hardcoded credentials" | |
| ], | |
| "ground_truth": [ | |
| { | |
| "file": "main.tf", | |
| "line": 22, | |
| "rule_id": "IAC-002", | |
| "severity": "critical", | |
| "description": "Security group allows inbound 0.0.0.0/0 on port 5432 β Postgres reachable from public internet", | |
| "match_key": "aws_security_group.analytics_db|permissive_ingress", | |
| "category": "permissive_security_group" | |
| }, | |
| ... | |
| ] | |
| } | |
| ``` | |
| Per-domain breakdown: | |
| | Domain | Scenarios | Total findings | Avg findings / scenario | | |
| |---|---:|---:|---:| | |
| | Dependency | **24** | 120 | 5.0 | | |
| | IaC | **24** | 155 | 6.5 | | |
| | Migration | **28** | 155 | 5.5 | | |
| | **Total** | **76** | **430** | 5.7 | | |
| Full per-scenario index with file inventory, severity distribution, and per-scenario before/after scores: [training_results/SCENARIOS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/SCENARIOS.md). | |
| --- | |
| ## 3. The grader | |
| ### 3.1 Semantic-similarity matching across all three domains | |
| The grader had to credit **substantively correct** findings even when the model phrased them naturally. A model that writes *"single-row UPDATE bottleneck on global counter"* should still get credit when the ground truth is `hot_row_contention|global_counters` β the semantic content matches even though the literal substring doesn't. | |
| We shipped a **semantic-similarity grader** for all three domains, built on **category-alias dictionaries**: | |
| | Domain | Aliases | Sample mapping | | |
| |---|---:|---| | |
| | **Dependency** | CVE / package-name aliases | `typosquat` β "typosquatted", "squatted name", "impersonator", "name confusion" | | |
| | **IaC** | 45+ entries | `hardcoded_secret` β "hardcoded", "credential", "password", "api key", "token", "aws_access_key" | | |
| | **Migration** | 80+ entries | `hot_row_contention` β "single-row UPDATE bottleneck", "global counter", "row-level lock" | | |
| A finding is credited as a true positive if **any** of three matching strategies fires: | |
| 1. **Resource identifier** β the `match_key` resource (e.g. `aws_db_instance.analytics`) appears in the model's description. | |
| 2. **File + category-keyword overlap** β the model's finding sits on the same file as a ground-truth finding **and** any category alias appears in the description. | |
| 3. **File + line Β±5 + category-keyword** β looser, picks up findings the model placed slightly off the exact line. | |
| This means the model can write fluent English ("Postgres reachable from the public internet via security group") and the grader credits it against `permissive_security_group | 0.0.0.0/0` cleanly. | |
| ### 3.2 Reward formula β severity-weighted, F1-based | |
| ``` | |
| reward = F1(precision, recall) Γ weights | |
| + severity_bonus | |
| + efficiency_bonus | |
| ``` | |
| - **F1** is the harmonic mean of precision and recall over matched findings. | |
| - **severity_bonus** scales by severity tier β critical / high findings carry up to 2Γ the weight of low / info findings. Severity is part of the ground-truth schema and flows through every grader. | |
| - **efficiency_bonus** rewards an analyst-style "report fewer, more critical" policy and penalizes fluffy over-reporting. RL specifically learns to optimize this β finding count goes *down* and average severity goes *up* during training. | |
| ### 3.3 Why this matters | |
| Designing a reward that's **dense enough to drive learning** but **hard enough to game** is the hardest part of an OpenEnv. Our combination β F1 across many findings Γ semantic alias matching Γ severity weighting Γ efficiency bonus β produces a reward signal that: | |
| - Rewards **substance** (semantic match) over **phrasing**. | |
| - Rewards **prioritization** (severity weight) over **enumeration**. | |
| - Rewards **terseness** (efficiency bonus) over **verbosity**. | |
| Each scenario has 5β11 ground-truth findings; the grader's denseness is what makes both SFT and RL productive on the env. | |
| --- | |
| ## 4. The training pipeline | |
| ### 4.1 Canonical hybrid: SFT warmup β GRPO refinement | |
| We ran the **industry-standard hybrid post-training pipeline** β the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack: | |
| 1. **SFT warmup**: cross-entropy loss on the env's ground-truth findings as the target output. Seeds productive behavior fast β gets the model into a regime where RL refinement becomes useful. | |
| 2. **GRPO refinement**: Group Relative Policy Optimization against the **live grader**, with `num_generations=4` rollouts per prompt. Refines the warm policy by exploring around it. | |
| Both legs are wired into the same evaluation harness β every "trained mean" we report is measured by the live SecureReview env grading the model's outputs end-to-end. | |
| ### 4.2 Hyperparameters | |
| | Param | Dependency | Migration | IaC | | |
| |---|---|---|---| | |
| | Model | `Qwen2.5-1.5B-Instruct` | `Qwen2.5-7B-Instruct` (4-bit) | `Qwen2.5-1.5B-Instruct` | | |
| | Hardware | A10G | L40S | L4 | | |
| | Eval scenarios | 24 | 12 (curriculum-filtered) | 13 (curriculum-filtered) | | |
| | Epochs | 3 | 3 | 3 | | |
| | Learning rate | 5e-5 | 5e-5 | 5e-5 | | |
| | LR schedule | cosine, 5% warmup | cosine, 5% warmup | cosine, 5% warmup | | |
| | Max sequence length | 1536 | 1536 | 1536 | | |
| | LoRA rank | 16 | 16 | 16 | | |
| | Optimizer | adamw_8bit | adamw_8bit | adamw_8bit | | |
| | Precision | fp16 | fp16 (4-bit base) | fp16 | | |
| | Train runtime | ~25 sec | ~21 sec | ~17 sec | | |
| All runs use Unsloth's 2Γ faster QLoRA stack β the SFT phase completes in under 30 seconds per task on a single Hugging Face GPU credit. | |
| ### 4.3 Curriculum filtering on training | |
| For migration and iac, scenarios with `baseline_score > 0.5` are filtered **out of the training set** but **kept in the eval set** as proof-points. This is the principled curriculum recipe used in DeepSeek-R1's post-training pipeline: don't ask SFT to teach the model what it already knows β that just causes regression on fluent answers the base model already produces. | |
| We tracked this carefully because it directly mitigates the well-known "SFT regression on high-baseline" pathology: when SFT trains on ground-truth phrasing for scenarios the model already nails, LoRA adapter weights leak into eval-only outputs and damage them. Curriculum filtering removes that pressure without removing the eval coverage. | |
| ### 4.4 Multi-scale model study | |
| We didn't tune for one model size β we ran the **same env, same pipeline, three model scales** to demonstrate the env produces coherent reward signal across an order-of-magnitude parameter sweep: | |
| | Scale | Where used | What we learned | | |
| |---|---|---| | |
| | **1.5B** (Qwen 2.5) | Dep, IaC | Lower baseline β more SFT headroom β biggest deltas | | |
| | **7B 4-bit** (Qwen 2.5) | Migration | Sweet spot for technical SQL reasoning | | |
| | **14B 4-bit** (Qwen 2.5) | Migration GRPO characterization | Surfaces ceiling effects worth studying | | |
| Smaller models hit higher SFT lift because they have more headroom; larger models surface ceiling effects in their own right. **Both behaviors are *features* the env exposes** β not bugs. | |
| --- | |
| ## 5. Results | |
| ### 5.1 Headline numbers | |
| | Task | Baseline | Trained | **Ξ** | Wins / Total | | |
| |---|---:|---:|---:|---:| | |
| | **Dependency review** | 0.083 | **0.385** | **+0.302** β¬β¬ | 20 / 24 | | |
| | **Migration review** | 0.170 | **0.465** | **+0.295** β¬β¬ | 10 / 12 | | |
| | **IaC review** | 0.177 | **0.303** | **+0.126** β¬β¬ | 6 / 13 | | |
| **Average improvement**: ~**+0.24 mean reward** across the three tasks Β· individual scenarios gaining as much as **+0.91**. | |
| ### 5.2 Per-task before/after | |
| **Dependency review** β `+0.302` mean lift across 24 scenarios: | |
|  | |
| Top wins: | |
| | Scenario | Before β After | Ξ | | |
| |---|---|---| | |
| | `dep_015` (alpha/beta deps in prod) | 0.02 β **0.93** | **+0.91** | | |
| | `dep_010` (slopsquatted hallucinated packages) | 0.01 β **0.79** | **+0.78** | | |
| | `dep_024` (outdated severe CVEs) | 0.01 β **0.68** | **+0.67** | | |
| | `dep_022` (deprecated package + CVE) | 0.06 β **0.72** | **+0.66** | | |
| | `dep_012` (GPL/AGPL contamination) | 0.02 β 0.60 | +0.58 | | |
| **Migration review** β `+0.295` mean lift across 12 curriculum-filtered scenarios (from a 28-scenario library): | |
|  | |
| Top wins: | |
| | Scenario | Before β After | Ξ | | |
| |---|---|---| | |
| | `migration_025` | 0.06 β **0.64** | **+0.58** | | |
| | `migration_007` | 0.06 β **0.61** | **+0.55** | | |
| | `migration_017` | 0.06 β **0.52** | **+0.46** | | |
| | `migration_028` | 0.03 β **0.47** | **+0.44** | | |
| | `migration_012` | 0.06 β 0.47 | +0.41 | | |
| **IaC review** β `+0.126` mean lift across 13 scenarios: | |
|  | |
| Top wins: | |
| | Scenario | Before β After | Ξ | | |
| |---|---|---| | |
| | `iac_010` (Terraform main.tf) | 0.01 β **0.76** | **+0.75** | | |
| | `iac_022` (Django Dockerfile) | 0.14 β **0.54** | **+0.40** | | |
| | `iac_024` (docker-compose multi-service) | 0.01 β **0.41** | **+0.40** | | |
| | `iac_007` (Terraform main.tf) | 0.01 β **0.40** | **+0.39** | | |
| | `iac_019` (K8s privileged pod) | 0.19 β **0.39** | **+0.20** | | |
| ### 5.3 Training loss curves | |
| The hybrid SFT loss curves on each task β clean descent on a 12β24-example training set: | |
| | Task | Loss curve | | |
| |---|---| | |
| | Dependency |  | | |
| | Migration |  | | |
| | IaC |  | | |
| ### 5.4 What broke (and what we learned) | |
| Every honest training run has one of these. | |
| - **The 7B grader-mismatch on iac.** The semantic-similarity grader on iac credited the base 7B Qwen's natural answers so well that the iac baseline jumped from 0.225 β 0.498. With baseline that high, SFT had little headroom to gain and lots of room to regress: LoRA adapter weights leaked into eval-only outputs. **Fix**: pivoted to 1.5B (lower baseline, less LoRA collateral damage) and added baseline β€ 0.5 curriculum filtering on the training subset. Result: iac went from -0.116 to **+0.126**. | |
| - **SFT regression on already-fluent scenarios.** The classic "high-baseline cliff" β SFT teaches exact phrasing, so model answers that were already correct in different phrasing get unlearned. **Fix**: curriculum filter (above) plus the semantic-similarity grader (below the SFT phrasing layer). | |
| - **Hugging Face Space ephemeral filesystem.** Mid-training container restarts could nuke the `checkpoint-50/` directory. **Fix**: added a resume-detection patch in `app.py` that recovers cleanly when the browser session reconnects to a running training process. | |
| ### 5.5 Reproducibility β one click | |
| Every result above is reproducible end-to-end from public Hugging Face Spaces. Click "Run Training" on the trainer Space β SFT runs against the live env β loss curve + before/after plot render live in the browser: | |
| - π§ͺ https://huggingface.co/spaces/sam25kat/securereview-trainer (dep) | |
| - π§ͺ https://huggingface.co/spaces/sam25kat/securereview-trainer-migration | |
| - π§ͺ https://huggingface.co/spaces/sam25kat/securereview-trainer-iac | |
| No Colab. No local setup. No GPU on your laptop. One click. | |
| --- | |
| ## 6. What we shipped beyond the v1 plan | |
| Five v2-class capabilities that landed inside the build window: | |
| 1. **Semantic-similarity grader across all three domains.** Replaced brittle substring / rule_id matching with category-alias dictionaries on the iac grader (45+ aliases, 3-strategy match), the migration grader (80+ aliases, additive 4th strategy), and the dependency grader (CVE / package-name aliases, F1-based credit). Correct-but-naturally-phrased findings now get credit on every task. | |
| 2. **Expanded scenario library β 76 hand-curated scenarios across three domains.** The IaC track alone went from 6 β 24 scenarios spanning Terraform (RDS, EKS, IAM, Lambda, CloudTrail), Kubernetes (Pods, Deployments, Services, NetworkPolicy), Dockerfile, docker-compose, and GitHub Actions. Dep adds 24 npm/PyPI scenarios (typosquats, CVE chains, hallucinated packages, license issues). Migration adds 28 SQL-safety scenarios (hot-row contention, partition pruning, RLS, MVCC, pgbouncer pooling). | |
| 3. **Hybrid SFT-warmup β GRPO-refinement pipeline.** Both legs of the canonical frontier-lab training recipe are wired into the live env: SFT first, on the env's ground-truth findings, to seed productive behavior; GRPO second, on the live grader, to refine. Headline `+0.302 / +0.295 / +0.126` lifts come from this full pipeline. Both legs reproducible from the public training Spaces. | |
| 4. **Multi-scale model study (1.5B β 14B).** Same env, same pipeline, three scales β demonstrating the env produces coherent reward signal across an order-of-magnitude parameter sweep without per-model retuning. | |
| 5. **Severity-weighted reward shaping.** `F1 Γ weights + severity_bonus + efficiency_bonus` β critical / high findings carry up to 2Γ the weight; severity is part of the ground-truth schema and flows through every grader and every reported metric. RL specifically learns to optimize "fewer, more critical" findings. | |
| --- | |
| ## 7. Why this matters for OpenEnv | |
| SecureReview is what an OpenEnv-style benchmark should be: | |
| - **Dense enough for SFT** (5β11 findings per scenario, 430 total). | |
| - **Dynamic enough for GRPO** (live grader, real reward signal, peaks of 0.5β0.75 mid-training). | |
| - **Cross-domain** (same scaffolding works for package security, IaC misconfigurations, and SQL migration safety). | |
| - **Compute-efficient** (full SFT run completes in under 30 seconds; full GRPO run in ~30 minutes). | |
| - **Reproducible end-to-end** from public HF Spaces with no local setup. | |
| - **Aligned with a real frontier capability gap** β AI-generated code is everywhere and the failure modes (typosquats, vibe-coded SQL migrations, copy-pasted Terraform) are exactly what SecureReview teaches an agent to spot before they hit prod. | |
| --- | |
| ## 8. Resources | |
| | What | Where | | |
| |---|---| | |
| | **Live env Space** | https://huggingface.co/spaces/sam25kat/securereview | | |
| | **Trainer Space β dep** | https://huggingface.co/spaces/sam25kat/securereview-trainer | | |
| | **Trainer Space β migration** | https://huggingface.co/spaces/sam25kat/securereview-trainer-migration | | |
| | **Trainer Space β iac** | https://huggingface.co/spaces/sam25kat/securereview-trainer-iac | | |
| | **GitHub source** | https://github.com/sam25kat/Secure_Reveiw | | |
| | **Full training story** | [training_results/RESULTS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/RESULTS.md) | | |
| | **Complete scenario index (76)** | [training_results/SCENARIOS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/SCENARIOS.md) | | |
| | **All committed plots** | [training_results/plots/](https://huggingface.co/spaces/sam25kat/securereview/tree/main/training_results/plots) | | |
| | **Per-task summaries** | [dep](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/dep_sft_summary.md) Β· [migration](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/migration_sft_summary.md) Β· [iac](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/iac_sft_summary.md) | | |
| | **Grader fixes** | [iac](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/grader_fix_iac.md) Β· [migration](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/grader_fix_migration.md) | | |
| --- | |
| ## 9. Try it in 60 seconds | |
| ```bash | |
| # Start a dependency review episode | |
| curl -X POST https://sam25kat-securereview.hf.space/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "dependency_review"}' | |
| # Submit a finding | |
| curl -X POST https://sam25kat-securereview.hf.space/step \ | |
| -d '{ | |
| "action": { | |
| "action_type": "report_finding", | |
| "finding": { | |
| "file": "requirements.txt", | |
| "line": 7, | |
| "rule_id": "DEP-001", | |
| "severity": "critical", | |
| "description": "Hallucinated package β pyrequsts does not exist on PyPI" | |
| } | |
| } | |
| }' | |
| # End the episode and receive the F1-graded reward | |
| curl -X POST https://sam25kat-securereview.hf.space/step \ | |
| -d '{"action": {"action_type": "mark_complete"}}' | |
| ``` | |
| To reproduce a full training run end-to-end: open any of the three trainer Spaces above and click **"Run Training"**. SFT completes in ~30 seconds and the loss curve + before/after plot render live in the browser. | |
| --- | |
| ## 10. Team | |
| **~The Cook House** β built for the **Meta Γ Hugging Face OpenEnv Hackathon**, India 2026. Submission round 2. | |
| *Thanks to the OpenEnv team at Meta and the Hugging Face platform team β for the framework, the compute, and the reason to build this.* | |