Aaron Brown commited on
Commit
cebc7ff
Β·
1 Parent(s): 1008330

Add docs and README

Browse files
.gitignore ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ *.egg-info/
7
+ dist/
8
+ build/
9
+ *.egg
10
+
11
+ # Virtual environments
12
+ .venv/
13
+ venv/
14
+ env/
15
+
16
+ # IDE
17
+ .idea/
18
+ .vscode/
19
+ *.swp
20
+ *.swo
21
+ *~
22
+
23
+ # OS
24
+ .DS_Store
25
+ Thumbs.db
26
+
27
+ # Docker build outputs (generated ranges)
28
+ outputs/
29
+
30
+ # Training outputs
31
+ training/outputs/
32
+ training/checkpoints/
33
+ training/logs/
34
+ wandb/
35
+ *.pt
36
+ *.safetensors
37
+ *.gguf
38
+
39
+ # Reward curves
40
+ training/*.png
41
+
42
+ # Environment
43
+ .env
44
+ .env.local
45
+ CLAUDE.md
46
+ IMPLEMENTATION_PLAN.md
47
+
48
+ # Jupyter
49
+ .ipynb_checkpoints/
50
+
51
+ # Test artifacts
52
+ .pytest_cache/
53
+ .coverage
54
+ htmlcov/
55
+
56
+ # Pre-validated range pool (generated at startup)
57
+ pool/
README.md ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenRange
2
+
3
+ **Multi-agent cyber gymnasium with real containers, golden-path validation, and self-evolving infrastructure.**
4
+
5
+ The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
6
+
7
+ ---
8
+
9
+ ## What is this?
10
+
11
+ OpenRange drops Red and Blue agents into a **real Docker network** β€” web apps, databases, firewalls, and all β€” then lets them fight. An LLM Builder generates the vulnerable infrastructure. A Validator confirms it's actually exploitable. And on every `reset()`, the Builder **mutates** the range with entirely different vulnerabilities, so agents can never memorize their way to victory.
12
+
13
+ ```
14
+ You write a YAML manifest describing what you want:
15
+ "2 hosts, DMZ network, web app with database, medium difficulty"
16
+
17
+ The Builder LLM generates it:
18
+ Real nginx + PHP app -> Real MySQL with flags -> Real firewall rules -> Golden path
19
+
20
+ The Validator confirms it works:
21
+ LLM review + 7 scripted checks including inverse mutation testing
22
+
23
+ Red attacks. Blue defends. Reset. New vulns. Repeat.
24
+ ```
25
+
26
+ ## Three Roles
27
+
28
+ | Role | What it does | Entry point |
29
+ |------|-------------|-------------|
30
+ | **Builder** | Generates and mutates vulnerable infrastructure from YAML manifests | LLM + templates |
31
+ | **Red** | Attacks live containers. Captures flags. | External -- no creds, no access |
32
+ | **Blue** | Defends via log analysis, patching, firewalling. | Internal -- monitor host |
33
+
34
+ Red and Blue operate on the **same infrastructure simultaneously**. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs.
35
+
36
+ ## Architecture
37
+
38
+ ```mermaid
39
+ flowchart TD
40
+ A[YAML Manifest<br/>Human-authored topology + vuln slots] --> B[Builder LLM<br/>Generates configs, plants vulns, writes golden path]
41
+ B --> C{Hybrid Validator}
42
+ C -->|Phase A| D[LLM Review<br/>Exploitability, alignment, difficulty]
43
+ C -->|Phase B| E[7-Check Scripted<br/>Services, flags, isolation,<br/>golden path, inverse mutation]
44
+ D --> F{PASS?}
45
+ E --> F
46
+ F -->|Yes| G[OpenEnv Server<br/>FastAPI: /reset, /step, /state, /ws]
47
+ F -->|No| B
48
+ G --> H[Red Agent<br/>nmap, curl, exploit, submit_flag]
49
+ G --> I[Blue Agent<br/>tail_log, grep, patch, iptables]
50
+ G --> J[NPC Traffic<br/>Background noise]
51
+ H --> K[(Docker Containers<br/>web, db, monitor)]
52
+ I --> K
53
+ J --> K
54
+
55
+ style A fill:#4a9eff,color:#fff
56
+ style B fill:#ff6b6b,color:#fff
57
+ style C fill:#ffd93d,color:#333
58
+ style G fill:#6bcb77,color:#fff
59
+ style K fill:#7c73e6,color:#fff
60
+ ```
61
+
62
+ ## Episode Lifecycle
63
+
64
+ ```mermaid
65
+ sequenceDiagram
66
+ participant T as Training Loop
67
+ participant E as OpenEnv Server
68
+ participant B as Builder LLM
69
+ participant V as Validator
70
+ participant C as Containers
71
+ participant R as Red Agent
72
+ participant Bl as Blue Agent
73
+
74
+ T->>E: reset()
75
+ E->>B: Manifest + mutation directive
76
+ B->>B: Generate structured JSON spec<br/>(vuln type, golden path, flags)
77
+ B->>C: Render templates -> hot-swap configs
78
+ C->>C: Restart affected services
79
+ E->>V: Validate range
80
+ V->>V: Phase A: LLM review
81
+ V->>C: Phase B: 7 scripted checks
82
+ V-->>E: PASS
83
+ E-->>T: RangeObservation (challenge description)
84
+
85
+ loop Episode Steps (alternating)
86
+ T->>E: step(Red: nmap -sV web)
87
+ E->>C: docker exec attacker nmap -sV web
88
+ C-->>E: stdout: 80/tcp open http
89
+ E-->>T: RangeObservation(stdout, reward)
90
+
91
+ T->>E: step(Blue: tail_log access.log)
92
+ E->>C: docker exec monitor tail access.log
93
+ C-->>E: log entries (Red + NPC mixed)
94
+ E-->>T: RangeObservation(stdout, reward)
95
+ end
96
+
97
+ Note over R,Bl: Red stealth reward coupled to Blue detection<br/>Blue detection reward coupled to Red actions
98
+ ```
99
+
100
+ ## Reset = Mutation
101
+
102
+ Every call to `reset()` triggers a **mutation** -- the Builder LLM swaps vulnerability classes in the running containers. The topology stays the same, but the challenge is completely different.
103
+
104
+ ```mermaid
105
+ flowchart LR
106
+ subgraph Episode 1
107
+ A1[SQLi in search form] --> F1[Flag in DB]
108
+ end
109
+ subgraph Episode 2
110
+ A2[Command injection<br/>in ping utility] --> F2[Flag on disk]
111
+ end
112
+ subgraph Episode 3
113
+ A3[SSRF -> internal SQLi] --> F3[Flag in internal DB]
114
+ end
115
+
116
+ Episode 1 -->|reset| Episode 2
117
+ Episode 2 -->|reset| Episode 3
118
+
119
+ style Episode 1 fill:#ff6b6b22,stroke:#ff6b6b
120
+ style Episode 2 fill:#ffd93d22,stroke:#ffd93d
121
+ style Episode 3 fill:#6bcb7722,stroke:#6bcb77
122
+ ```
123
+
124
+ Agents must **generalize** across vulnerability classes, not memorize exploit chains.
125
+
126
+ ## Quick Start
127
+
128
+ ```bash
129
+ # Install
130
+ git clone https://github.com/[team]/open-range.git
131
+ cd open-range
132
+ uv sync --all-extras
133
+
134
+ # Run the OpenEnv server locally
135
+ uv run uvicorn server.app:app --host 0.0.0.0 --port 8000
136
+
137
+ # Connect a client
138
+ python -c "
139
+ from client import OpenRangeEnv
140
+ from server.models import RangeAction
141
+
142
+ with OpenRangeEnv('http://localhost:8000').sync() as env:
143
+ result = env.reset()
144
+ print(result.observation.stdout)
145
+
146
+ result = env.step(RangeAction(command='nmap -sV web', mode='red'))
147
+ print(result.observation.stdout)
148
+ "
149
+ ```
150
+
151
+ ## Reward Signals
152
+
153
+ All rewards are **verifiable** -- grounded in real container state, not LLM judgment.
154
+
155
+ ```mermaid
156
+ flowchart TB
157
+ subgraph Red Rewards
158
+ RF[Flag Capture<br/>docker exec cat flag<br/>binary match]
159
+ RE[Efficiency<br/>gamma^steps]
160
+ RS[Stealth<br/>Did Blue detect?]
161
+ RH[Anti-hallucination<br/>-0.3 per fake flag]
162
+ end
163
+
164
+ subgraph Blue Rewards
165
+ BD[Detection<br/>TP rate vs Red's log]
166
+ BP[Patch<br/>Golden path re-run fails]
167
+ BA[Availability<br/>Healthcheck fraction]
168
+ BF[False Positive<br/>-0.2 per NPC flagged]
169
+ end
170
+
171
+ subgraph Coupling
172
+ RS -.-|depends on| BD
173
+ BD -.-|depends on| RF
174
+ end
175
+
176
+ style Red Rewards fill:#ff6b6b11,stroke:#ff6b6b
177
+ style Blue Rewards fill:#4a9eff11,stroke:#4a9eff
178
+ style Coupling fill:#ffd93d11,stroke:#ffd93d,stroke-dasharray: 5 5
179
+ ```
180
+
181
+ ## Golden Path Validation
182
+
183
+ Every generated range passes a **7-check validation pipeline** before any agent touches it:
184
+
185
+ ```mermaid
186
+ flowchart LR
187
+ S1[1. Services up<br/>nc -z ports] --> S2[2. Flags exist<br/>docker exec cat]
188
+ S2 --> S3[3. Network isolation<br/>external !-> internal]
189
+ S3 --> S4[4. Golden path<br/>execute exploit steps]
190
+ S4 --> S5[5. Difficulty<br/>steps within 20%]
191
+ S5 --> S6[6. No leaks<br/>grep description]
192
+ S6 --> S7[7. Inverse mutation<br/>revert vuln -> step fails]
193
+
194
+ S7 -->|All pass| PASS[VALID]
195
+ S7 -->|Any fail| FAIL[RETRY<br/>Builder gets error context]
196
+
197
+ style PASS fill:#6bcb77,color:#fff
198
+ style FAIL fill:#ff6b6b,color:#fff
199
+ style S7 fill:#ffd93d,color:#333
200
+ ```
201
+
202
+ Check 7 is from [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552): it proves each planted vulnerability actually contributes to the challenge.
203
+
204
+ ## Tier System
205
+
206
+ Difficulty grows **horizontally** -- more hosts, more networks, more services. Not just harder passwords.
207
+
208
+ ```mermaid
209
+ flowchart TD
210
+ subgraph Tier 1 - Basic
211
+ W1[web<br/>nginx + PHP] --> D1[db<br/>MySQL]
212
+ end
213
+
214
+ subgraph Tier 2 - Corporate
215
+ W2[web] --> D2[db]
216
+ W2 --> M2[mail<br/>SMTP]
217
+ FW2[firewall<br/>iptables] --> W2
218
+ end
219
+
220
+ subgraph Tier 3 - Enterprise
221
+ W3[web] --> D3[db]
222
+ W3 --> DC3[DC<br/>LDAP/Kerberos]
223
+ FS3[files<br/>SMB] --> DC3
224
+ end
225
+
226
+ style Tier 1 - Basic fill:#6bcb7722,stroke:#6bcb77
227
+ style Tier 2 - Corporate fill:#ffd93d22,stroke:#ffd93d
228
+ style Tier 3 - Enterprise fill:#ff6b6b22,stroke:#ff6b6b
229
+ ```
230
+
231
+ | Tier | Hosts | Networks | Services | Golden Steps |
232
+ |------|-------|----------|----------|--------------|
233
+ | 1 | web + db | dmz | nginx, mysql, sshd | ~8 |
234
+ | 2 | + mail + fw | + internal | + smtp, iptables | ~15 |
235
+ | 3 | + files + DC | + mgmt | + smb, ldap, kerberos | ~25 |
236
+ | 4 | + jump + NPC | all | + bastion, cron, rsync | ~35 |
237
+ | 5 | + honeypot | + trap | + decoys, WAF, IDS | ~50 |
238
+
239
+ ## Tandem Red + Blue Training
240
+
241
+ ```mermaid
242
+ sequenceDiagram
243
+ participant Red as Red Agent<br/>(attacker)
244
+ participant Env as Range<br/>(containers)
245
+ participant Blue as Blue Agent<br/>(defender)
246
+
247
+ Note over Red,Blue: Episode begins -- Builder mutated range
248
+
249
+ Red->>Env: nmap -sV web
250
+ Env-->>Red: 80/tcp open http nginx
251
+ Note right of Env: Action logged
252
+
253
+ Blue->>Env: tail_log access.log
254
+ Env-->>Blue: [NPC traffic + Red's scan mixed]
255
+ Blue->>Env: submit_finding: port scan detected
256
+ Note left of Blue: True positive!
257
+
258
+ Red->>Env: curl 'web/search?q=' OR 1=1--
259
+ Env-->>Red: Database results + flag
260
+ Note right of Env: Action logged
261
+
262
+ Red->>Env: submit_flag FLAG{abc123}
263
+ Env-->>Red: Correct! reward=1.0
264
+
265
+ Blue->>Env: grep_log "UNION|SELECT|OR 1"
266
+ Env-->>Blue: SQLi pattern found
267
+ Blue->>Env: patch search.php (parameterize query)
268
+ Env-->>Blue: Patch applied
269
+
270
+ Note over Env: Re-run golden path exploit
271
+ Note over Env: Exploit FAILS -> patch valid
272
+
273
+ Note over Red,Blue: Red stealth: LOW (Blue caught it)<br/>Blue detection: HIGH (found real attack)
274
+ ```
275
+
276
+ ## Project Structure
277
+
278
+ ```
279
+ open-range/
280
+ β”œβ”€β”€ manifests/ YAML range definitions (topology, vulns, golden paths)
281
+ β”œβ”€β”€ vulns/ Vulnerability catalog (plantable vuln templates)
282
+ β”œβ”€β”€ builder/ Builder LLM + Mutator + rendering templates
283
+ β”œβ”€β”€ validator/ Hybrid validator (LLM review + 7-check scripted)
284
+ β”œβ”€β”€ server/ OpenEnv server (Environment, models, rewards, app.py)
285
+ β”œβ”€β”€ client/ Typed OpenEnv client
286
+ β”œβ”€β”€ docs/ Architecture docs and guides
287
+ β”œβ”€β”€ examples/ Demo scripts
288
+ └── tests/ Test suite
289
+ ```
290
+
291
+ ## Built On
292
+
293
+ - [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
294
+ - Lessons from [R2E-Gym](https://arxiv.org/abs/2504.07164) (hybrid verification) and [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552) (formal specs, inverse mutation testing, frontier-calibrating rewards)
295
+
296
+ ## License
297
+
298
+ Apache 2.0
docs/architecture.md ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Architecture
2
+
3
+ ## System Overview
4
+
5
+ OpenRange is a 5-layer system. Data flows top-to-bottom during setup, loops during episodes, and feeds back up during curriculum escalation.
6
+
7
+ ```
8
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
9
+ β”‚ YAML MANIFEST β”‚
10
+ β”‚ Topology, vuln slots, golden path, difficulty β”‚
11
+ β”‚ (human-authored) β”‚
12
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
13
+ β”‚
14
+ β–Ό
15
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
16
+ β”‚ BUILDER LLM β”‚
17
+ β”‚ Structured JSON spec β†’ template rendering β†’ β”‚
18
+ β”‚ Dockerfiles, configs, vulnerable app code, β”‚
19
+ β”‚ flag placement, golden path, NPC scripts β”‚
20
+ β”‚ Called on every reset() to MUTATE the range β”‚
21
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
22
+ β”‚
23
+ β–Ό
24
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
25
+ β”‚ HYBRID VALIDATOR β”‚
26
+ β”‚ Phase A: LLM reviews exploitability, β”‚
27
+ β”‚ alignment, difficulty β”‚
28
+ β”‚ Phase B: 7-check scripted execution β”‚
29
+ β”‚ (services, flags, isolation, β”‚
30
+ β”‚ golden path, difficulty, β”‚
31
+ β”‚ leak check, inverse mutation) β”‚
32
+ β”‚ PASS β†’ proceed FAIL β†’ Builder retries β”‚
33
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
34
+ β”‚
35
+ β–Ό
36
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
37
+ β”‚ OPENENV SERVER β”‚
38
+ β”‚ β”‚
39
+ β”‚ FastAPI: /reset, /step, /state, /ws β”‚
40
+ β”‚ β”‚
41
+ β”‚ RangeAction(command, mode) ──────────────────┐ β”‚
42
+ β”‚ RangeObservation(stdout, stderr, reward) β—„β”€β”€β”€β”˜ β”‚
43
+ β”‚ β”‚
44
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
45
+ β”‚ β”‚ RED β”‚ β”‚ BLUE β”‚ β”‚ NPC β”‚ β”‚
46
+ β”‚ β”‚ External β”‚ β”‚ Monitor β”‚ β”‚ Traffic β”‚ β”‚
47
+ β”‚ β”‚ attacker β”‚ β”‚ defender β”‚ β”‚ noise β”‚ β”‚
48
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
49
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
50
+ β”‚
51
+ β–Ό
52
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
53
+ β”‚ DOCKER CONTAINERS (range) β”‚
54
+ β”‚ β”‚
55
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
56
+ β”‚ β”‚ web │───▢│ db β”‚ β”‚monitor β”‚ β”‚
57
+ β”‚ β”‚nginx+ β”‚ β”‚ mysql β”‚ β”‚ logs β”‚ β”‚
58
+ β”‚ β”‚PHP app β”‚ β”‚ flags β”‚ β”‚ Blue β”‚ β”‚
59
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
60
+ β”‚ DMZ Internal Mgmt β”‚
61
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
62
+ ```
63
+
64
+ ## Data Flow
65
+
66
+ ### Setup (once)
67
+ 1. Human writes YAML manifest defining topology + vuln slots
68
+ 2. Builder LLM generates initial infrastructure
69
+ 3. `docker compose up` starts all containers
70
+ 4. Validator confirms range is exploitable and correctly configured
71
+
72
+ ### Episode Loop
73
+ 1. `reset()` β†’ Builder LLM mutates vulns (new class, new flag, new golden path)
74
+ 2. Hot-swap configs into running containers, restart affected services
75
+ 3. Validator confirms mutation is valid (LLM + 7 scripted checks)
76
+ 4. Red and Blue agents interact via `step(RangeAction)`:
77
+ - Red: executes commands against containers (nmap, curl, ssh, submit_flag)
78
+ - Blue: reads logs, patches vulns, blocks IPs (tail_log, iptables, patch, submit_finding)
79
+ 5. Environment computes rewards from verifiable container state
80
+ 6. Episode ends when: flag captured, max steps, timeout, or all vulns patched
81
+
82
+ ### Curriculum (optional, post-training)
83
+ 1. Track Red solve rate and Blue detection rate
84
+ 2. Builder LLM adjusts difficulty via `r_inject = 1 - (1+alpha)*s`
85
+ 3. When agents plateau: horizontal growth (add hosts, networks, services)
86
+
87
+ ## Builder LLM Output Format
88
+
89
+ The Builder produces **structured JSON**, not prose (lesson from Self-Play SWE-RL):
90
+
91
+ ```json
92
+ {
93
+ "vulns": [
94
+ {
95
+ "type": "sqli",
96
+ "host": "web",
97
+ "injection_point": "/search?q=",
98
+ "vulnerable_code": "<?php $q=$_GET['q']; $r=mysql_query(\"SELECT * FROM products WHERE name LIKE '%$q%'\"); ?>",
99
+ "flag_location": "/var/flags/flag1.txt"
100
+ }
101
+ ],
102
+ "flags": [
103
+ {"id": "flag1", "value": "FLAG{r4nd0m_v4lu3}", "path": "/var/flags/flag1.txt", "host": "web"}
104
+ ],
105
+ "golden_path": [
106
+ {"step": 1, "cmd": "nmap -sV web", "expect_stdout": "80/tcp open http"},
107
+ {"step": 2, "cmd": "curl http://web/", "expect_stdout": "search"},
108
+ {"step": 3, "cmd": "curl 'http://web/search?q=test' UNION SELECT flag FROM flags--'", "expect_stdout": "FLAG{"},
109
+ {"step": 4, "cmd": "submit_flag FLAG{r4nd0m_v4lu3}", "expect_stdout": "correct"}
110
+ ],
111
+ "challenge_description": "Target network has a web application with a database backend. Find and exploit vulnerabilities to capture the flag."
112
+ }
113
+ ```
114
+
115
+ A thin template layer renders this JSON into actual config files (PHP, nginx.conf, init.sql).
116
+
117
+ ## Hybrid Validation Pipeline
118
+
119
+ Two phases, both must pass:
120
+
121
+ **Phase A: LLM Review** (fast, catches design-level bugs)
122
+ - Is the vulnerability actually exploitable given these configs?
123
+ - Does the challenge description match without leaking the answer?
124
+ - Is the golden path correct for this vuln type?
125
+ - Is the difficulty right for the current tier?
126
+
127
+ **Phase B: 7-Check Scripted Execution** (ground truth, catches implementation bugs)
128
+ 1. Services respond on expected ports
129
+ 2. Flags exist at expected paths with correct values
130
+ 3. Network isolation holds (external can't reach internal)
131
+ 4. Golden path commands produce expected outputs
132
+ 5. Step count within 20% of difficulty target
133
+ 6. Challenge description contains no flag values or exploit details
134
+ 7. Inverse mutation test: reverting each vuln breaks its golden path step
135
+
136
+ ## Reward Architecture
137
+
138
+ All rewards implemented as OpenEnv `Rubric` subclasses:
139
+
140
+ ```
141
+ CompositeRedReward (WeightedSum)
142
+ β”œβ”€β”€ FlagReward binary, docker exec verified
143
+ β”œβ”€β”€ EfficiencyReward gamma^steps
144
+ β”œβ”€β”€ StealthReward coupled to Blue's detection history
145
+ β”œβ”€β”€ EvidenceReward quality of submit_evidence
146
+ └── HallucinationPenalty -0.3 per fake flag
147
+
148
+ CompositeBlueReward (WeightedSum)
149
+ β”œβ”€β”€ DetectionReward TP rate vs Red's action log
150
+ β”œβ”€β”€ PatchReward binary, golden path re-execution
151
+ β”œβ”€β”€ AvailabilityReward healthcheck fraction
152
+ └── FalsePositiveReward -0.2 per NPC traffic flagged
153
+ ```
154
+
155
+ Rewards are computed from **container state and action logs**, never from LLM judgment.
docs/builder-validator.md ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Builder + Validator Design
2
+
3
+ ## Builder LLM
4
+
5
+ The Builder generates vulnerable infrastructure from YAML manifests. It's called:
6
+ - Once at startup (initial range creation)
7
+ - On every `reset()` (mutation β€” swap vulnerability classes)
8
+
9
+ ### Input
10
+
11
+ ```yaml
12
+ # From the YAML manifest
13
+ topology:
14
+ hosts:
15
+ - name: web
16
+ zone: dmz
17
+ services: [nginx, php, sshd]
18
+ - name: db
19
+ zone: internal
20
+ services: [mysql]
21
+ networks: [dmz, internal]
22
+
23
+ difficulty:
24
+ tier: 1
25
+ max_steps: 10
26
+
27
+ # Plus runtime context
28
+ previous_vuln_classes: [sqli] # What was planted last episode
29
+ agent_solve_rate: 0.6 # How often Red solves (for difficulty calibration)
30
+ ```
31
+
32
+ ### Output (Structured JSON)
33
+
34
+ The Builder outputs a **formal spec**, not prose. Lesson from Self-Play SWE-RL: natural language generation failed with a 32B model. Formal specs are reliable.
35
+
36
+ ```json
37
+ {
38
+ "vulns": [{
39
+ "type": "idor",
40
+ "host": "web",
41
+ "injection_point": "/api/user/{id}",
42
+ "vulnerable_code": "...",
43
+ "flag_location": "/var/flags/flag1.txt"
44
+ }],
45
+ "flags": [{
46
+ "id": "flag1",
47
+ "value": "FLAG{abc123}",
48
+ "path": "/var/flags/flag1.txt",
49
+ "host": "web"
50
+ }],
51
+ "golden_path": [{
52
+ "step": 1,
53
+ "cmd": "nmap -sV web",
54
+ "expect_stdout": "80/tcp open http"
55
+ }],
56
+ "challenge_description": "A web application with user management. Find the vulnerability."
57
+ }
58
+ ```
59
+
60
+ A thin template layer (`builder/templates/`) renders the JSON into actual files.
61
+
62
+ ### Mutation Strategy
63
+
64
+ On `reset()`, the Builder:
65
+ 1. Picks a **different** vuln class than the previous episode
66
+ 2. Generates new vulnerable code, flag values, and golden path
67
+ 3. Renders config files via templates
68
+ 4. Hot-swaps into running containers (`docker cp` + service restart)
69
+ 5. Does NOT tear down the full stack β€” partial restart only
70
+
71
+ ## Validator (Hybrid)
72
+
73
+ Validates that the Builder's output actually works. Two phases.
74
+
75
+ ### Phase A: LLM Review
76
+
77
+ The Validator LLM reviews the Builder's structured JSON spec:
78
+
79
+ ```
80
+ Given this vulnerability specification and these config files:
81
+ 1. Is the vulnerability actually exploitable? (Would the golden path commands work?)
82
+ 2. Does the challenge description match without revealing the exploit?
83
+ 3. Is the golden path complete? (No missing steps?)
84
+ 4. Is this the right difficulty for tier {N}?
85
+
86
+ Answer YES/NO with brief reasoning for each.
87
+ ```
88
+
89
+ ### Phase B: 7-Check Scripted Execution
90
+
91
+ Run against live containers:
92
+
93
+ | Check | Command | Pass Condition |
94
+ |-------|---------|----------------|
95
+ | 1. Services | `nc -z web 80` | All ports respond |
96
+ | 2. Flags | `docker exec web cat /var/flags/flag1.txt` | Matches spec value |
97
+ | 3. Isolation | `docker exec attacker curl http://db:3306` | Connection refused |
98
+ | 4. Golden path | Execute each step | `expect_stdout` found in output |
99
+ | 5. Difficulty | `len(golden_path)` | Within 20% of `max_steps` |
100
+ | 6. No leaks | `grep -i "FLAG{" challenge_description` | No matches |
101
+ | 7. Inverse mutation | Revert each vuln, re-run its golden path step | Step FAILS |
102
+
103
+ Check 7 (inverse mutation) is the most important. It proves each planted vulnerability is necessary. Without it, the Builder could plant a "decorative" vuln that passes validation but isn't actually the path to the flag.
104
+
105
+ ### Failure Handling
106
+
107
+ ```
108
+ Builder generates spec
109
+ β†’ Validator Phase A (LLM) β†’ FAIL β†’ Builder retries with feedback
110
+ β†’ Validator Phase B (scripted) β†’ FAIL β†’ Builder retries with error context
111
+ β†’ 3 failures β†’ Use last known-good configuration
112
+ ```
113
+
114
+ ### Toxic Validation Warning
115
+
116
+ R2E-Gym found ~10% of validations incorrectly favor wrong solutions. Track:
117
+ - False-positive rate (accepted broken ranges that don't produce training signal)
118
+ - False-negative rate (rejected valid ranges unnecessarily)
119
+ - Log every validation decision for post-hoc auditing
docs/openenv-compliance.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Compliance Guide
2
+
3
+ OpenRange implements the OpenEnv 0.2.x environment contract. This doc maps every requirement.
4
+
5
+ ## Checklist
6
+
7
+ | Requirement | Status | Implementation |
8
+ |-------------|--------|----------------|
9
+ | `Environment` subclass | Required | `RangeEnvironment(Environment[RangeAction, RangeObservation, RangeState])` |
10
+ | `reset()` returns `ObsT` | Required | Returns `RangeObservation` |
11
+ | `step()` returns `ObsT` | Required | Returns `RangeObservation` |
12
+ | `state` property returns `StateT` | Required | Returns `RangeState` |
13
+ | `Action` subclass (Pydantic, extra=forbid) | Required | `RangeAction(Action)` with `command`, `mode` |
14
+ | `Observation` subclass (Pydantic, extra=forbid) | Required | `RangeObservation(Observation)` β€” inherits `done`, `reward` from base |
15
+ | `State` subclass (Pydantic, extra=allow) | Required | `RangeState(State)` β€” inherits `episode_id`, `step_count` from base |
16
+ | `create_app(Class, ActionType, ObsType)` | Required | Pass CLASS not instance |
17
+ | `EnvClient` subclass | Required | `OpenRangeEnv(EnvClient[...])` |
18
+ | `_step_payload()` | Required | Serializes `RangeAction` to dict |
19
+ | `_parse_result()` | Required | Parses server response to `StepResult[RangeObservation]` |
20
+ | `_parse_state()` | Required | Parses server response to `RangeState` |
21
+ | `/health` endpoint | Auto | Provided by `create_app` |
22
+ | `/ws` WebSocket | Auto | Provided by `create_app` |
23
+ | `/reset`, `/step`, `/state` HTTP | Auto | Provided by `create_app` |
24
+ | `Rubric` for rewards | Optional | `CompositeRedReward`, `CompositeBlueReward` as Rubric subclasses |
25
+ | `openenv.yaml` manifest | Required | Environment metadata for HF Spaces |
26
+ | `Dockerfile` | Required | For container deployment |
27
+
28
+ ## Common Mistakes to Avoid
29
+
30
+ 1. **Don't redeclare `done` or `reward` on Observation.** The base class already has them.
31
+ 2. **Don't redeclare `episode_id` or `step_count` on State.** The base class already has them.
32
+ 3. **Pass the CLASS to `create_app()`, not an instance.** Each WebSocket session gets its own instance.
33
+ 4. **Action uses `extra="forbid"`.** Unknown fields cause validation errors. Keep actions minimal.
34
+ 5. **State uses `extra="allow"`.** You can add any fields you want.
35
+ 6. **`reset()` returns ObsT (server-side), `StepResult[ObsT]` (client-side).** The server wraps it.
36
+
37
+ ## API Signatures (Exact)
38
+
39
+ ```python
40
+ # Server-side
41
+ class RangeEnvironment(Environment[RangeAction, RangeObservation, RangeState]):
42
+ def reset(self, seed: Optional[int] = None,
43
+ episode_id: Optional[str] = None, **kwargs) -> RangeObservation: ...
44
+ def step(self, action: RangeAction,
45
+ timeout_s: Optional[float] = None, **kwargs) -> RangeObservation: ...
46
+ @property
47
+ def state(self) -> RangeState: ...
48
+
49
+ # Client-side
50
+ class OpenRangeEnv(EnvClient[RangeAction, RangeObservation, RangeState]):
51
+ def _step_payload(self, action: RangeAction) -> dict: ...
52
+ def _parse_result(self, payload: dict) -> StepResult[RangeObservation]: ...
53
+ def _parse_state(self, payload: dict) -> RangeState: ...
54
+
55
+ # App factory
56
+ app = create_app(RangeEnvironment, RangeAction, RangeObservation, env_name="open_range")
57
+ ```
58
+
59
+ ## Reference Implementations
60
+
61
+ Study these OpenEnv environments as patterns:
62
+
63
+ - **`envs/coding_env/`** β€” closest analog (execute code, get stdout/stderr). Uses `Environment` base.
64
+ - **`envs/echo_env/`** β€” simplest possible environment. Uses `MCPEnvironment` base.
65
+ - **`envs/finqa_env/`** β€” MCP tool-based with complex rewards. Uses `MCPEnvironment` base.