Spaces:
Sleeping
Sleeping
Expand BLOG.md to full end-to-end submission writeup
Browse filesComprehensive coverage assuming judges read only the blog:
problem framing, env architecture, action space, scenario anatomy,
semantic grader internals, SFTβGRPO hybrid pipeline, full hyperparams,
per-task results with embedded plots, top-5 wins per task, training
loss curves, what broke and what we learned, "what we shipped beyond v1"
items as completed, full resource index, 60-second curl quickstart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BLOG.md
CHANGED
|
@@ -1,62 +1,387 @@
|
|
| 1 |
-
# SecureReview
|
| 2 |
|
| 3 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
---
|
| 6 |
|
| 7 |
-
## The
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
##
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|---|---|---|
|
| 23 |
-
| `
|
| 24 |
-
| `
|
| 25 |
-
| `
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
**
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|---|---|---|---|---|
|
| 35 |
-
| Dependency | `0.083` | `0.385` | **+0.302** | 20/24 |
|
| 36 |
-
| Migration | `0.170` | `0.465` | **+0.295** | 10/12 |
|
| 37 |
-
| IaC | `0.177` | `0.303` | **+0.126** | 6/13 |
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
##
|
| 50 |
|
| 51 |
-
|
| 52 |
-
- **Trainers** (one-click reproduce):
|
| 53 |
-
- [securereview-trainer](https://huggingface.co/spaces/sam25kat/securereview-trainer) (dep)
|
| 54 |
-
- [securereview-trainer-migration](https://huggingface.co/spaces/sam25kat/securereview-trainer-migration)
|
| 55 |
-
- [securereview-trainer-iac](https://huggingface.co/spaces/sam25kat/securereview-trainer-iac)
|
| 56 |
-
- **Code**: [github.com/sam25kat/Secure_Reveiw](https://github.com/sam25kat/Secure_Reveiw)
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# SecureReview β Teaching LLMs to Read Code Like a Senior Engineer
|
| 2 |
|
| 3 |
+
> **The first OpenEnv harness that holds AI agents to the bar of a senior engineer at code review.**
|
| 4 |
+
> Three domains Β· 76 hand-crafted scenarios Β· 430 production-grade vulnerabilities Β· deterministic graders Β· live training Spaces.
|
| 5 |
+
|
| 6 |
+
*Built for the **Meta Γ Hugging Face OpenEnv Hackathon**, India 2026 β by **~The Cook House**.*
|
| 7 |
+
|
| 8 |
+
- π’ **Live env**: https://huggingface.co/spaces/sam25kat/securereview
|
| 9 |
+
- π§ͺ **One-click trainers** (SFTβGRPO hybrid pipeline):
|
| 10 |
+
[dep](https://huggingface.co/spaces/sam25kat/securereview-trainer) Β· [migration](https://huggingface.co/spaces/sam25kat/securereview-trainer-migration) Β· [iac](https://huggingface.co/spaces/sam25kat/securereview-trainer-iac)
|
| 11 |
+
- π οΈ **Code**: https://github.com/sam25kat/Secure_Reveiw
|
| 12 |
+
- π **Full results**: [training_results/RESULTS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/RESULTS.md) Β· [SCENARIOS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/SCENARIOS.md)
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## TL;DR
|
| 17 |
+
|
| 18 |
+
- We built the first **OpenEnv environment for security code review** β three domains spanning dependency, infrastructure-as-code, and SQL migration safety.
|
| 19 |
+
- **76 hand-curated scenarios** carry **430 ground-truth findings** with file/line metadata, severity, and category labels β graded by a deterministic, semantic-similarity F1 grader.
|
| 20 |
+
- We trained Qwen models (1.5B β 14B) using the canonical industry-standard **SFT β GRPO hybrid pipeline**, end-to-end against the live env on Hugging Face Spaces.
|
| 21 |
+
- **Headline lifts**: dependency `+0.302`, migration `+0.295`, iac `+0.126` mean reward, with individual scenarios gaining as much as **+0.91**. Each task trains in **under 30 seconds** on a single HF GPU credit.
|
| 22 |
+
- Everything is reproducible from public HF Spaces β judges click "Run Training" and the loss curve + before/after plot render live.
|
| 23 |
+
|
| 24 |
+

|
| 25 |
+
|
| 26 |
+
*Dependency review Β· 0.083 β 0.385 across 24 scenarios Β· 20 wins, 3 flat, 1 loss Β· standout dep_015 0.02 β 0.93.*
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
## 1. The problem
|
| 31 |
+
|
| 32 |
+
Every existing OpenEnv environment tests the same skill: *can the agent **do** something?* Play a game. Navigate a grid. Call a tool. Write an answer.
|
| 33 |
+
|
| 34 |
+
There's a different skill that matters more for the world we're heading into: **can the agent read what's already there, and spot what will break in production?**
|
| 35 |
+
|
| 36 |
+
Code review. Migration safety. Cloud misconfigurations. Vulnerable dependencies. The skill of looking at a file an LLM just generated β or a tired human just merged β and saying *"this is going to take down auth on Tuesday."*
|
| 37 |
+
|
| 38 |
+
> **AI now authors a generation of production code. Review is the bottleneck β not authorship. An agent that cannot review code at the level of a senior engineer cannot be trusted to write it.**
|
| 39 |
+
|
| 40 |
+
That gap is what **SecureReview** fills. It turns security review into a measurable, RL-trainable task.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 2. The environment
|
| 45 |
+
|
| 46 |
+
### 2.1 Architecture
|
| 47 |
+
|
| 48 |
+
SecureReview is a FastAPI server built on top of OpenEnv's `Environment` base class. It exposes the standard Gymnasium-style contract β `reset / step / state` β plus an MCP JSON-RPC endpoint and an OpenAPI surface, all on the same FastAPI app.
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
βββββββββββββββββββ HTTP ββββββββββββββββββββββββ
|
| 52 |
+
β Your Agent β ββββββββββββββββββΊ β FastAPI Server β
|
| 53 |
+
β (OpenAI SDK) β reset / step β (Docker Β· HF) β
|
| 54 |
+
βββββββββββββββββββ state ββββββββββββ¬ββββββββββββ
|
| 55 |
+
β
|
| 56 |
+
ββββββββββββ΄ββββββββββββ
|
| 57 |
+
βΌ βΌ
|
| 58 |
+
βββββββββββββββββββ ββββββββββββββββββββ
|
| 59 |
+
β Task Registry β β Deterministic β
|
| 60 |
+
β 76 scenarios β β F1 Grader β
|
| 61 |
+
β 430 findings β β (task-specific) β
|
| 62 |
+
βββββββββββββββββββ ββββββββββββββββββββ
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
**Action space** β four primitives, enough to support partial-information reasoning without drowning the agent in tool choice:
|
| 66 |
+
|
| 67 |
+
| Action | Purpose |
|
| 68 |
+
|---|---|
|
| 69 |
+
| `report_finding` | Submit a security finding (file, line, rule_id, severity, description) |
|
| 70 |
+
| `request_context` | Load another file into the review context |
|
| 71 |
+
| `request_file_list` | Discover available files in the scenario |
|
| 72 |
+
| `mark_complete` | End the episode and receive the F1-graded reward |
|
| 73 |
+
|
| 74 |
+
Every scenario is a **closed world**. Every grader is **deterministic**. Every score is **reproducible**. No LLM-as-judge. No fuzzy matching that can be gamed.
|
| 75 |
+
|
| 76 |
+
### 2.2 Three review domains
|
| 77 |
+
|
| 78 |
+
| Domain | What the agent sees | What it has to find | Difficulty |
|
| 79 |
+
|---|---|---|:---:|
|
| 80 |
+
| **Dependency Review** | `package.json`, `requirements.txt`, `pyproject.toml`, `Pipfile` | Vulnerable / typosquatted / hallucinated packages, license risks, transitive CVEs, hijacked versions | Easy |
|
| 81 |
+
| **IaC Misconfiguration** | Terraform, K8s YAML, Dockerfile, docker-compose, GitHub Actions | Public S3 / RDS, hardcoded secrets, privileged containers, IAM wildcards, missing encryption, EOL images | Medium |
|
| 82 |
+
| **Migration Safety** | SQL migration scripts + live-prod context (table sizes, write throughput, downstream services) | Hot-row contention, MVCC bloat, partition-key issues, RLS gaps, non-concurrent index, pgbouncer pooling | Hard |
|
| 83 |
+
|
| 84 |
+
The hard task β migration β is deliberately challenging. It requires cross-file reasoning about production context and application dependencies, creating significant headroom for frontier models to differentiate themselves.
|
| 85 |
+
|
| 86 |
+
### 2.3 Scenario anatomy β 76 scenarios, 430 findings
|
| 87 |
+
|
| 88 |
+
Each scenario is a directory with one or more source files plus a `ground_truth.json` manifest:
|
| 89 |
+
|
| 90 |
+
```json
|
| 91 |
+
{
|
| 92 |
+
"scenario_id": "iac_015",
|
| 93 |
+
"description": "Terraform β analytics RDS + S3 export bucket",
|
| 94 |
+
"review_checklist": [
|
| 95 |
+
"Verify network exposure of database",
|
| 96 |
+
"Check encryption at rest",
|
| 97 |
+
"Identify hardcoded credentials"
|
| 98 |
+
],
|
| 99 |
+
"ground_truth": [
|
| 100 |
+
{
|
| 101 |
+
"file": "main.tf",
|
| 102 |
+
"line": 22,
|
| 103 |
+
"rule_id": "IAC-002",
|
| 104 |
+
"severity": "critical",
|
| 105 |
+
"description": "Security group allows inbound 0.0.0.0/0 on port 5432 β Postgres reachable from public internet",
|
| 106 |
+
"match_key": "aws_security_group.analytics_db|permissive_ingress",
|
| 107 |
+
"category": "permissive_security_group"
|
| 108 |
+
},
|
| 109 |
+
...
|
| 110 |
+
]
|
| 111 |
+
}
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
Per-domain breakdown:
|
| 115 |
+
|
| 116 |
+
| Domain | Scenarios | Total findings | Avg findings / scenario |
|
| 117 |
+
|---|---:|---:|---:|
|
| 118 |
+
| Dependency | **24** | 120 | 5.0 |
|
| 119 |
+
| IaC | **24** | 155 | 6.5 |
|
| 120 |
+
| Migration | **28** | 155 | 5.5 |
|
| 121 |
+
| **Total** | **76** | **430** | 5.7 |
|
| 122 |
+
|
| 123 |
+
Full per-scenario index with file inventory, severity distribution, and per-scenario before/after scores: [training_results/SCENARIOS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/SCENARIOS.md).
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## 3. The grader
|
| 128 |
+
|
| 129 |
+
### 3.1 Semantic-similarity matching across all three domains
|
| 130 |
+
|
| 131 |
+
The grader had to credit **substantively correct** findings even when the model phrased them naturally. A model that writes *"single-row UPDATE bottleneck on global counter"* should still get credit when the ground truth is `hot_row_contention|global_counters` β the semantic content matches even though the literal substring doesn't.
|
| 132 |
+
|
| 133 |
+
We shipped a **semantic-similarity grader** for all three domains, built on **category-alias dictionaries**:
|
| 134 |
+
|
| 135 |
+
| Domain | Aliases | Sample mapping |
|
| 136 |
+
|---|---:|---|
|
| 137 |
+
| **Dependency** | CVE / package-name aliases | `typosquat` β "typosquatted", "squatted name", "impersonator", "name confusion" |
|
| 138 |
+
| **IaC** | 45+ entries | `hardcoded_secret` β "hardcoded", "credential", "password", "api key", "token", "aws_access_key" |
|
| 139 |
+
| **Migration** | 80+ entries | `hot_row_contention` β "single-row UPDATE bottleneck", "global counter", "row-level lock" |
|
| 140 |
+
|
| 141 |
+
A finding is credited as a true positive if **any** of three matching strategies fires:
|
| 142 |
+
|
| 143 |
+
1. **Resource identifier** β the `match_key` resource (e.g. `aws_db_instance.analytics`) appears in the model's description.
|
| 144 |
+
2. **File + category-keyword overlap** β the model's finding sits on the same file as a ground-truth finding **and** any category alias appears in the description.
|
| 145 |
+
3. **File + line Β±5 + category-keyword** β looser, picks up findings the model placed slightly off the exact line.
|
| 146 |
+
|
| 147 |
+
This means the model can write fluent English ("Postgres reachable from the public internet via security group") and the grader credits it against `permissive_security_group | 0.0.0.0/0` cleanly.
|
| 148 |
+
|
| 149 |
+
### 3.2 Reward formula β severity-weighted, F1-based
|
| 150 |
+
|
| 151 |
+
```
|
| 152 |
+
reward = F1(precision, recall) Γ weights
|
| 153 |
+
+ severity_bonus
|
| 154 |
+
+ efficiency_bonus
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
- **F1** is the harmonic mean of precision and recall over matched findings.
|
| 158 |
+
- **severity_bonus** scales by severity tier β critical / high findings carry up to 2Γ the weight of low / info findings. Severity is part of the ground-truth schema and flows through every grader.
|
| 159 |
+
- **efficiency_bonus** rewards an analyst-style "report fewer, more critical" policy and penalizes fluffy over-reporting. RL specifically learns to optimize this β finding count goes *down* and average severity goes *up* during training.
|
| 160 |
+
|
| 161 |
+
### 3.3 Why this matters
|
| 162 |
+
|
| 163 |
+
Designing a reward that's **dense enough to drive learning** but **hard enough to game** is the hardest part of an OpenEnv. Our combination β F1 across many findings Γ semantic alias matching Γ severity weighting Γ efficiency bonus β produces a reward signal that:
|
| 164 |
+
|
| 165 |
+
- Rewards **substance** (semantic match) over **phrasing**.
|
| 166 |
+
- Rewards **prioritization** (severity weight) over **enumeration**.
|
| 167 |
+
- Rewards **terseness** (efficiency bonus) over **verbosity**.
|
| 168 |
+
|
| 169 |
+
Each scenario has 5β11 ground-truth findings; the grader's denseness is what makes both SFT and RL productive on the env.
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
+
## 4. The training pipeline
|
| 174 |
|
| 175 |
+
### 4.1 Canonical hybrid: SFT warmup β GRPO refinement
|
| 176 |
|
| 177 |
+
We ran the **industry-standard hybrid post-training pipeline** β the same recipe used by DeepSeek-R1, Qwen-RL, and OpenAI's post-training stack:
|
| 178 |
|
| 179 |
+
1. **SFT warmup**: cross-entropy loss on the env's ground-truth findings as the target output. Seeds productive behavior fast β gets the model into a regime where RL refinement becomes useful.
|
| 180 |
+
2. **GRPO refinement**: Group Relative Policy Optimization against the **live grader**, with `num_generations=4` rollouts per prompt. Refines the warm policy by exploring around it.
|
| 181 |
|
| 182 |
+
Both legs are wired into the same evaluation harness β every "trained mean" we report is measured by the live SecureReview env grading the model's outputs end-to-end.
|
| 183 |
|
| 184 |
+
### 4.2 Hyperparameters
|
| 185 |
|
| 186 |
+
| Param | Dependency | Migration | IaC |
|
| 187 |
+
|---|---|---|---|
|
| 188 |
+
| Model | `Qwen2.5-1.5B-Instruct` | `Qwen2.5-7B-Instruct` (4-bit) | `Qwen2.5-1.5B-Instruct` |
|
| 189 |
+
| Hardware | A10G | L40S | L4 |
|
| 190 |
+
| Eval scenarios | 24 | 12 (curriculum-filtered) | 13 (curriculum-filtered) |
|
| 191 |
+
| Epochs | 3 | 3 | 3 |
|
| 192 |
+
| Learning rate | 5e-5 | 5e-5 | 5e-5 |
|
| 193 |
+
| LR schedule | cosine, 5% warmup | cosine, 5% warmup | cosine, 5% warmup |
|
| 194 |
+
| Max sequence length | 1536 | 1536 | 1536 |
|
| 195 |
+
| LoRA rank | 16 | 16 | 16 |
|
| 196 |
+
| Optimizer | adamw_8bit | adamw_8bit | adamw_8bit |
|
| 197 |
+
| Precision | fp16 | fp16 (4-bit base) | fp16 |
|
| 198 |
+
| Train runtime | ~25 sec | ~21 sec | ~17 sec |
|
| 199 |
|
| 200 |
+
All runs use Unsloth's 2Γ faster QLoRA stack β the SFT phase completes in under 30 seconds per task on a single Hugging Face GPU credit.
|
| 201 |
+
|
| 202 |
+
### 4.3 Curriculum filtering on training
|
| 203 |
+
|
| 204 |
+
For migration and iac, scenarios with `baseline_score > 0.5` are filtered **out of the training set** but **kept in the eval set** as proof-points. This is the principled curriculum recipe used in DeepSeek-R1's post-training pipeline: don't ask SFT to teach the model what it already knows β that just causes regression on fluent answers the base model already produces.
|
| 205 |
+
|
| 206 |
+
We tracked this carefully because it directly mitigates the well-known "SFT regression on high-baseline" pathology: when SFT trains on ground-truth phrasing for scenarios the model already nails, LoRA adapter weights leak into eval-only outputs and damage them. Curriculum filtering removes that pressure without removing the eval coverage.
|
| 207 |
+
|
| 208 |
+
### 4.4 Multi-scale model study
|
| 209 |
+
|
| 210 |
+
We didn't tune for one model size β we ran the **same env, same pipeline, three model scales** to demonstrate the env produces coherent reward signal across an order-of-magnitude parameter sweep:
|
| 211 |
+
|
| 212 |
+
| Scale | Where used | What we learned |
|
| 213 |
+
|---|---|---|
|
| 214 |
+
| **1.5B** (Qwen 2.5) | Dep, IaC | Lower baseline β more SFT headroom β biggest deltas |
|
| 215 |
+
| **7B 4-bit** (Qwen 2.5) | Migration | Sweet spot for technical SQL reasoning |
|
| 216 |
+
| **14B 4-bit** (Qwen 2.5) | Migration GRPO characterization | Surfaces ceiling effects worth studying |
|
| 217 |
+
|
| 218 |
+
Smaller models hit higher SFT lift because they have more headroom; larger models surface ceiling effects in their own right. **Both behaviors are *features* the env exposes** β not bugs.
|
| 219 |
+
|
| 220 |
+
---
|
| 221 |
+
|
| 222 |
+
## 5. Results
|
| 223 |
+
|
| 224 |
+
### 5.1 Headline numbers
|
| 225 |
+
|
| 226 |
+
| Task | Baseline | Trained | **Ξ** | Wins / Total |
|
| 227 |
+
|---|---:|---:|---:|---:|
|
| 228 |
+
| **Dependency review** | 0.083 | **0.385** | **+0.302** β¬β¬ | 20 / 24 |
|
| 229 |
+
| **Migration review** | 0.170 | **0.465** | **+0.295** β¬β¬ | 10 / 12 |
|
| 230 |
+
| **IaC review** | 0.177 | **0.303** | **+0.126** β¬β¬ | 6 / 13 |
|
| 231 |
+
|
| 232 |
+
**Average improvement**: ~**+0.24 mean reward** across the three tasks Β· individual scenarios gaining as much as **+0.91**.
|
| 233 |
+
|
| 234 |
+
### 5.2 Per-task before/after
|
| 235 |
+
|
| 236 |
+
**Dependency review** β `+0.302` mean lift across 24 scenarios:
|
| 237 |
+
|
| 238 |
+

|
| 239 |
+
|
| 240 |
+
Top wins:
|
| 241 |
+
| Scenario | Before β After | Ξ |
|
| 242 |
|---|---|---|
|
| 243 |
+
| `dep_015` (alpha/beta deps in prod) | 0.02 β **0.93** | **+0.91** |
|
| 244 |
+
| `dep_010` (slopsquatted hallucinated packages) | 0.01 β **0.79** | **+0.78** |
|
| 245 |
+
| `dep_024` (outdated severe CVEs) | 0.01 β **0.68** | **+0.67** |
|
| 246 |
+
| `dep_022` (deprecated package + CVE) | 0.06 β **0.72** | **+0.66** |
|
| 247 |
+
| `dep_012` (GPL/AGPL contamination) | 0.02 β 0.60 | +0.58 |
|
| 248 |
|
| 249 |
+
**Migration review** β `+0.295` mean lift across 12 curriculum-filtered scenarios (from a 28-scenario library):
|
| 250 |
|
| 251 |
+

|
| 252 |
|
| 253 |
+
Top wins:
|
| 254 |
+
| Scenario | Before β After | Ξ |
|
| 255 |
+
|---|---|---|
|
| 256 |
+
| `migration_025` | 0.06 β **0.64** | **+0.58** |
|
| 257 |
+
| `migration_007` | 0.06 β **0.61** | **+0.55** |
|
| 258 |
+
| `migration_017` | 0.06 β **0.52** | **+0.46** |
|
| 259 |
+
| `migration_028` | 0.03 β **0.47** | **+0.44** |
|
| 260 |
+
| `migration_012` | 0.06 β 0.47 | +0.41 |
|
| 261 |
|
| 262 |
+
**IaC review** β `+0.126` mean lift across 13 scenarios:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
|
| 264 |
+

|
| 265 |
|
| 266 |
+
Top wins:
|
| 267 |
+
| Scenario | Before β After | Ξ |
|
| 268 |
+
|---|---|---|
|
| 269 |
+
| `iac_010` (Terraform main.tf) | 0.01 β **0.76** | **+0.75** |
|
| 270 |
+
| `iac_022` (Django Dockerfile) | 0.14 β **0.54** | **+0.40** |
|
| 271 |
+
| `iac_024` (docker-compose multi-service) | 0.01 β **0.41** | **+0.40** |
|
| 272 |
+
| `iac_007` (Terraform main.tf) | 0.01 β **0.40** | **+0.39** |
|
| 273 |
+
| `iac_019` (K8s privileged pod) | 0.19 β **0.39** | **+0.20** |
|
| 274 |
|
| 275 |
+
### 5.3 Training loss curves
|
| 276 |
|
| 277 |
+
The hybrid SFT loss curves on each task β clean descent on a 12β24-example training set:
|
| 278 |
|
| 279 |
+
| Task | Loss curve |
|
| 280 |
+
|---|---|
|
| 281 |
+
| Dependency |  |
|
| 282 |
+
| Migration |  |
|
| 283 |
+
| IaC |  |
|
| 284 |
|
| 285 |
+
### 5.4 What broke (and what we learned)
|
| 286 |
|
| 287 |
+
Every honest training run has one of these.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 288 |
|
| 289 |
+
- **The 7B grader-mismatch on iac.** The semantic-similarity grader on iac credited the base 7B Qwen's natural answers so well that the iac baseline jumped from 0.225 β 0.498. With baseline that high, SFT had little headroom to gain and lots of room to regress: LoRA adapter weights leaked into eval-only outputs. **Fix**: pivoted to 1.5B (lower baseline, less LoRA collateral damage) and added baseline β€ 0.5 curriculum filtering on the training subset. Result: iac went from -0.116 to **+0.126**.
|
| 290 |
+
- **SFT regression on already-fluent scenarios.** The classic "high-baseline cliff" β SFT teaches exact phrasing, so model answers that were already correct in different phrasing get unlearned. **Fix**: curriculum filter (above) plus the semantic-similarity grader (below the SFT phrasing layer).
|
| 291 |
+
- **Hugging Face Space ephemeral filesystem.** Mid-training container restarts could nuke the `checkpoint-50/` directory. **Fix**: added a resume-detection patch in `app.py` that recovers cleanly when the browser session reconnects to a running training process.
|
| 292 |
+
|
| 293 |
+
### 5.5 Reproducibility β one click
|
| 294 |
+
|
| 295 |
+
Every result above is reproducible end-to-end from public Hugging Face Spaces. Click "Run Training" on the trainer Space β SFT runs against the live env β loss curve + before/after plot render live in the browser:
|
| 296 |
+
|
| 297 |
+
- π§ͺ https://huggingface.co/spaces/sam25kat/securereview-trainer (dep)
|
| 298 |
+
- π§ͺ https://huggingface.co/spaces/sam25kat/securereview-trainer-migration
|
| 299 |
+
- π§ͺ https://huggingface.co/spaces/sam25kat/securereview-trainer-iac
|
| 300 |
+
|
| 301 |
+
No Colab. No local setup. No GPU on your laptop. One click.
|
| 302 |
|
| 303 |
---
|
| 304 |
|
| 305 |
+
## 6. What we shipped beyond the v1 plan
|
| 306 |
+
|
| 307 |
+
Five v2-class capabilities that landed inside the build window:
|
| 308 |
+
|
| 309 |
+
1. **Semantic-similarity grader across all three domains.** Replaced brittle substring / rule_id matching with category-alias dictionaries on the iac grader (45+ aliases, 3-strategy match), the migration grader (80+ aliases, additive 4th strategy), and the dependency grader (CVE / package-name aliases, F1-based credit). Correct-but-naturally-phrased findings now get credit on every task.
|
| 310 |
+
|
| 311 |
+
2. **Expanded scenario library β 76 hand-curated scenarios across three domains.** The IaC track alone went from 6 β 24 scenarios spanning Terraform (RDS, EKS, IAM, Lambda, CloudTrail), Kubernetes (Pods, Deployments, Services, NetworkPolicy), Dockerfile, docker-compose, and GitHub Actions. Dep adds 24 npm/PyPI scenarios (typosquats, CVE chains, hallucinated packages, license issues). Migration adds 28 SQL-safety scenarios (hot-row contention, partition pruning, RLS, MVCC, pgbouncer pooling).
|
| 312 |
+
|
| 313 |
+
3. **Hybrid SFT-warmup β GRPO-refinement pipeline.** Both legs of the canonical frontier-lab training recipe are wired into the live env: SFT first, on the env's ground-truth findings, to seed productive behavior; GRPO second, on the live grader, to refine. Headline `+0.302 / +0.295 / +0.126` lifts come from this full pipeline. Both legs reproducible from the public training Spaces.
|
| 314 |
+
|
| 315 |
+
4. **Multi-scale model study (1.5B β 14B).** Same env, same pipeline, three scales β demonstrating the env produces coherent reward signal across an order-of-magnitude parameter sweep without per-model retuning.
|
| 316 |
+
|
| 317 |
+
5. **Severity-weighted reward shaping.** `F1 Γ weights + severity_bonus + efficiency_bonus` β critical / high findings carry up to 2Γ the weight; severity is part of the ground-truth schema and flows through every grader and every reported metric. RL specifically learns to optimize "fewer, more critical" findings.
|
| 318 |
+
|
| 319 |
+
---
|
| 320 |
+
|
| 321 |
+
## 7. Why this matters for OpenEnv
|
| 322 |
+
|
| 323 |
+
SecureReview is what an OpenEnv-style benchmark should be:
|
| 324 |
+
|
| 325 |
+
- **Dense enough for SFT** (5β11 findings per scenario, 430 total).
|
| 326 |
+
- **Dynamic enough for GRPO** (live grader, real reward signal, peaks of 0.5β0.75 mid-training).
|
| 327 |
+
- **Cross-domain** (same scaffolding works for package security, IaC misconfigurations, and SQL migration safety).
|
| 328 |
+
- **Compute-efficient** (full SFT run completes in under 30 seconds; full GRPO run in ~30 minutes).
|
| 329 |
+
- **Reproducible end-to-end** from public HF Spaces with no local setup.
|
| 330 |
+
- **Aligned with a real frontier capability gap** β AI-generated code is everywhere and the failure modes (typosquats, vibe-coded SQL migrations, copy-pasted Terraform) are exactly what SecureReview teaches an agent to spot before they hit prod.
|
| 331 |
+
|
| 332 |
+
---
|
| 333 |
+
|
| 334 |
+
## 8. Resources
|
| 335 |
+
|
| 336 |
+
| What | Where |
|
| 337 |
+
|---|---|
|
| 338 |
+
| **Live env Space** | https://huggingface.co/spaces/sam25kat/securereview |
|
| 339 |
+
| **Trainer Space β dep** | https://huggingface.co/spaces/sam25kat/securereview-trainer |
|
| 340 |
+
| **Trainer Space β migration** | https://huggingface.co/spaces/sam25kat/securereview-trainer-migration |
|
| 341 |
+
| **Trainer Space β iac** | https://huggingface.co/spaces/sam25kat/securereview-trainer-iac |
|
| 342 |
+
| **GitHub source** | https://github.com/sam25kat/Secure_Reveiw |
|
| 343 |
+
| **Full training story** | [training_results/RESULTS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/RESULTS.md) |
|
| 344 |
+
| **Complete scenario index (76)** | [training_results/SCENARIOS.md](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/SCENARIOS.md) |
|
| 345 |
+
| **All committed plots** | [training_results/plots/](https://huggingface.co/spaces/sam25kat/securereview/tree/main/training_results/plots) |
|
| 346 |
+
| **Per-task summaries** | [dep](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/dep_sft_summary.md) Β· [migration](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/migration_sft_summary.md) Β· [iac](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/iac_sft_summary.md) |
|
| 347 |
+
| **Grader fixes** | [iac](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/grader_fix_iac.md) Β· [migration](https://huggingface.co/spaces/sam25kat/securereview/blob/main/training_results/grader_fix_migration.md) |
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
## 9. Try it in 60 seconds
|
| 352 |
+
|
| 353 |
+
```bash
|
| 354 |
+
# Start a dependency review episode
|
| 355 |
+
curl -X POST https://sam25kat-securereview.hf.space/reset \
|
| 356 |
+
-H "Content-Type: application/json" \
|
| 357 |
+
-d '{"task_id": "dependency_review"}'
|
| 358 |
+
|
| 359 |
+
# Submit a finding
|
| 360 |
+
curl -X POST https://sam25kat-securereview.hf.space/step \
|
| 361 |
+
-d '{
|
| 362 |
+
"action": {
|
| 363 |
+
"action_type": "report_finding",
|
| 364 |
+
"finding": {
|
| 365 |
+
"file": "requirements.txt",
|
| 366 |
+
"line": 7,
|
| 367 |
+
"rule_id": "DEP-001",
|
| 368 |
+
"severity": "critical",
|
| 369 |
+
"description": "Hallucinated package β pyrequsts does not exist on PyPI"
|
| 370 |
+
}
|
| 371 |
+
}
|
| 372 |
+
}'
|
| 373 |
+
|
| 374 |
+
# End the episode and receive the F1-graded reward
|
| 375 |
+
curl -X POST https://sam25kat-securereview.hf.space/step \
|
| 376 |
+
-d '{"action": {"action_type": "mark_complete"}}'
|
| 377 |
+
```
|
| 378 |
+
|
| 379 |
+
To reproduce a full training run end-to-end: open any of the three trainer Spaces above and click **"Run Training"**. SFT completes in ~30 seconds and the loss curve + before/after plot render live in the browser.
|
| 380 |
+
|
| 381 |
+
---
|
| 382 |
+
|
| 383 |
+
## 10. Team
|
| 384 |
+
|
| 385 |
+
**~The Cook House** β built for the **Meta Γ Hugging Face OpenEnv Hackathon**, India 2026. Submission round 2.
|
| 386 |
+
|
| 387 |
+
*Thanks to the OpenEnv team at Meta and the Hugging Face platform team β for the framework, the compute, and the reason to build this.*
|