Aaron Brown Claude Opus 4.6 commited on
Commit
819cfef
Β·
1 Parent(s): fd25822

Align docs with hackathon statements: multi-agent, self-improvement, simulated experts

Browse files

- README: zero-sum multi-agent framing (Statement 1), long-horizon sparse
reward framing (Statement 2), curriculum feedback loop section (Statement 4),
simulated expert NPC framing (Snorkel), tier-scaled reward mention (Mercor)
- architecture.md: ComplexityBonus in reward tree, tier-scaled reward ceiling
table, curriculum section upgraded from post-hackathon to core
- IMPLEMENTATION_PLAN.md: curriculum feedback moved to Phase 3.8
- CLAUDE.md: r_complexity reward signal added to both Red and Blue tables
- Issue #11: updated with ComplexityBonus and tier multiplier table
- Issue #34: created for curriculum feedback loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +32 -5
  2. docs/architecture.md +24 -5
README.md CHANGED
@@ -1,6 +1,6 @@
1
  # OpenRange
2
 
3
- **Multi-agent cyber range with validated company snapshots, coupled Red/Blue rewards, and evolving enterprise worlds.**
4
 
5
  The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
6
 
@@ -36,7 +36,7 @@ The OpenEnv runtime stays standard:
36
  | **Red** | External attacker. Recon, exploit, pivot, escalate, exfiltrate. | Outside the firewall -- no creds, no access |
37
  | **Blue** | Internal defender. SIEM analysis, patching, firewall rules, incident response. | SOC workstation on management network |
38
 
39
- Red and Blue operate on the **same infrastructure simultaneously**. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs.
40
 
41
  ## Architecture
42
 
@@ -146,7 +146,7 @@ flowchart TB
146
 
147
  Every service is real. The web app queries the database. Users authenticate against LDAP. Mail flows through Postfix. Logs stream to the SIEM. NPC traffic simulates employees browsing, sending email, and running cron jobs -- so Blue can't just flag everything as malicious.
148
 
149
- NPCs evolve from shell-script noise generators to **LLM-driven employees** with persona cards, susceptibility profiles, and realistic communication styles. Red can craft spearphishing emails, pretext calls, and watering-hole attacks against NPCs who decide whether to click, ignore, or report based on their security awareness. Blue must detect these social engineering campaigns in logs alongside normal traffic.
150
 
151
  ## Episode Lifecycle
152
 
@@ -279,7 +279,9 @@ with OpenRangeEnv('http://localhost:8000').sync() as env:
279
 
280
  ## Reward Signals
281
 
282
- All rewards are **verifiable** -- grounded in real container state, not LLM judgment.
 
 
283
 
284
  ```mermaid
285
  flowchart TB
@@ -374,6 +376,31 @@ flowchart TD
374
  style t3 fill:#ff6b6b22,stroke:#ff6b6b
375
  ```
376
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
  ## Tandem Red + Blue Training
378
 
379
  ```mermaid
@@ -434,7 +461,7 @@ open-range/
434
  ## Built On
435
 
436
  - [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
437
- - Design ideas from PAIRED / UED (generate inside a legal family), POET (mutate plus admit), [R2E-Gym](https://arxiv.org/abs/2504.07164) (executable verification), and [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552) (formal specs and inverse mutation testing)
438
 
439
  ## License
440
 
 
1
  # OpenRange
2
 
3
+ **Multi-agent cyber range with zero-sum Red/Blue dynamics, validated company snapshots, and self-improving enterprise worlds.**
4
 
5
  The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
6
 
 
36
  | **Red** | External attacker. Recon, exploit, pivot, escalate, exfiltrate. | Outside the firewall -- no creds, no access |
37
  | **Blue** | Internal defender. SIEM analysis, patching, firewall rules, incident response. | SOC workstation on management network |
38
 
39
+ Red and Blue operate on the **same infrastructure simultaneously** in a zero-sum adversarial dynamic. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs. This multi-agent coupling creates natural co-evolution: as Red learns stealth, Blue must learn deeper detection -- and vice versa.
40
 
41
  ## Architecture
42
 
 
146
 
147
  Every service is real. The web app queries the database. Users authenticate against LDAP. Mail flows through Postfix. Logs stream to the SIEM. NPC traffic simulates employees browsing, sending email, and running cron jobs -- so Blue can't just flag everything as malicious.
148
 
149
+ NPCs evolve from shell-script noise generators to **LLM-driven simulated experts** -- employees with persona cards, susceptibility profiles, and realistic communication styles. These are domain-specialized LLM agents (marketing coordinator, CISO, IT admin) that generate authentic enterprise behavior: sending emails, filing tickets, browsing intranet, and responding to social engineering attempts based on their security awareness level. Red can craft spearphishing emails, pretext calls, and watering-hole attacks against NPCs who decide whether to click, ignore, or report. Blue must detect these social engineering campaigns in logs alongside normal NPC traffic.
150
 
151
  ## Episode Lifecycle
152
 
 
279
 
280
  ## Reward Signals
281
 
282
+ Episodes are **long-horizon** (8-50+ steps depending on tier) with **sparse delayed rewards**. Flag capture is binary and only fires at the end of a successful exploit chain. Stealth and detection rewards are computed at episode end from the full action log. Intermediate steps yield only small efficiency signals -- agents must learn to plan multi-step strategies without dense per-action feedback.
283
+
284
+ All rewards are **verifiable** -- grounded in real container state, not LLM judgment. Reward ceilings **scale with environment complexity**: higher-tier snapshots (more hosts, zones, and chained vulnerabilities) offer proportionally larger maximum rewards, ensuring the training signal grows with output quality.
285
 
286
  ```mermaid
287
  flowchart TB
 
376
  style t3 fill:#ff6b6b22,stroke:#ff6b6b
377
  ```
378
 
379
+ ## Curriculum Feedback Loop
380
+
381
+ OpenRange is **self-improving**. Per-snapshot solve rates and detection rates feed back to the Builder, which adjusts the next snapshot's difficulty and vulnerability mix to target the frontier of agent capability.
382
+
383
+ ```
384
+ Episode results (solve rate, detection rate, time-to-flag)
385
+ |
386
+ v
387
+ Curriculum tracker (per vuln class, per tier)
388
+ |
389
+ v
390
+ Builder receives runtime_context:
391
+ { red_solve_rate: 0.6, blue_detect_rate: 0.4,
392
+ previous_vuln_classes: [sqli, weak_creds],
393
+ weak_areas: [ssrf, chained_vulns] }
394
+ |
395
+ v
396
+ Next snapshot targets agent weaknesses:
397
+ - If Red solves SQLi easily β†’ seed SSRF or chained vulns
398
+ - If Blue misses lateral movement β†’ add more pivot points
399
+ - Difficulty adjusts via r_inject = 1 - (1+Ξ±)Β·s
400
+ ```
401
+
402
+ The Builder LLM acts as a **simulated expert curriculum designer** -- it doesn't just randomize, it analyzes agent performance and generates challenges calibrated to the learning frontier. This is the same frontier-calibrating reward from Self-Play SWE-RL, adapted for cybersecurity.
403
+
404
  ## Tandem Red + Blue Training
405
 
406
  ```mermaid
 
461
  ## Built On
462
 
463
  - [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
464
+ - Design ideas from PAIRED / UED (generate inside a legal family), POET (mutate plus admit), [R2E-Gym](https://arxiv.org/abs/2504.07164) (executable verification), [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552) (formal specs and inverse mutation testing), and [Snorkel](https://www.snorkel.ai/) (simulated domain experts for data generation)
465
 
466
  ## License
467
 
docs/architecture.md CHANGED
@@ -179,12 +179,15 @@ sequenceDiagram
179
  Note over T,R: Episode ends: flag captured, max steps, or timeout
180
  ```
181
 
182
- ### Curriculum (post-hackathon)
183
 
184
- 1. Track Red solve rate and Blue detection rate per snapshot
185
- 2. Feed failure stats back to builder for next mutation
186
- 3. Builder LLM adjusts difficulty via `r_inject = 1 - (1+alpha)*s`
187
- 4. When agents plateau: horizontal growth (add containers, zones, services)
 
 
 
188
 
189
  ## Snapshot Artifacts
190
 
@@ -212,6 +215,7 @@ CompositeRedReward (WeightedSum)
212
  β”œβ”€β”€ StealthReward coupled to Blue detection history
213
  β”œβ”€β”€ EvidenceReward quality of submit_evidence
214
  β”œβ”€β”€ SocialEngineeringReward NPC fell for phish/pretext (Level 1+)
 
215
  └── HallucinationPenalty -0.3 per fake flag
216
 
217
  CompositeBlueReward (WeightedSum)
@@ -219,11 +223,26 @@ CompositeBlueReward (WeightedSum)
219
  β”œβ”€β”€ PatchReward binary, golden path re-execution
220
  β”œβ”€β”€ AvailabilityReward healthcheck fraction
221
  β”œβ”€β”€ PhishingDetection correctly identified social engineering in logs (Level 1+)
 
222
  └── FalsePositiveReward -0.2 per NPC traffic/email flagged
223
  ```
224
 
225
  Rewards are computed from **container state and action logs**, never from LLM judgment.
226
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  ## NPC Evolution: Shell Scripts to LLM Agents
228
 
229
  NPCs progress from mechanical noise generators to intelligent social engineering targets. Each level adds a modality without removing the previous one.
 
179
  Note over T,R: Episode ends: flag captured, max steps, or timeout
180
  ```
181
 
182
+ ### Curriculum Feedback
183
 
184
+ The Builder acts as a **simulated expert curriculum designer**. Episode results feed back to shape future snapshots:
185
+
186
+ 1. Track Red solve rate and Blue detection rate per snapshot (per vuln class, per tier)
187
+ 2. Feed failure stats to Builder as `runtime_context` on next build
188
+ 3. Builder LLM adjusts difficulty via `r_inject = 1 - (1+alpha)*s` (frontier calibration from SWE-RL)
189
+ 4. Target agent weaknesses: if Red masters SQLi, seed SSRF or chained vulns next
190
+ 5. When agents plateau: horizontal growth (add containers, zones, services)
191
 
192
  ## Snapshot Artifacts
193
 
 
215
  β”œβ”€β”€ StealthReward coupled to Blue detection history
216
  β”œβ”€β”€ EvidenceReward quality of submit_evidence
217
  β”œβ”€β”€ SocialEngineeringReward NPC fell for phish/pretext (Level 1+)
218
+ β”œβ”€β”€ ComplexityBonus tier_multiplier * base_reward (scales with snapshot complexity)
219
  └── HallucinationPenalty -0.3 per fake flag
220
 
221
  CompositeBlueReward (WeightedSum)
 
223
  β”œβ”€β”€ PatchReward binary, golden path re-execution
224
  β”œβ”€β”€ AvailabilityReward healthcheck fraction
225
  β”œβ”€β”€ PhishingDetection correctly identified social engineering in logs (Level 1+)
226
+ β”œβ”€β”€ ComplexityBonus tier_multiplier * base_reward (scales with snapshot complexity)
227
  └── FalsePositiveReward -0.2 per NPC traffic/email flagged
228
  ```
229
 
230
  Rewards are computed from **container state and action logs**, never from LLM judgment.
231
 
232
+ ### Tier-Scaled Reward Ceiling
233
+
234
+ Reward ceilings scale with environment complexity so that harder snapshots produce proportionally larger training signals:
235
+
236
+ | Tier | Hosts | Multiplier | Max Red Reward | Max Blue Reward |
237
+ |------|-------|-----------|----------------|-----------------|
238
+ | 1 | 6-8 | 1.0x | 1.0 | 1.0 |
239
+ | 2 | 10-12 | 1.5x | 1.5 | 1.5 |
240
+ | 3 | 14-18 | 2.0x | 2.0 | 2.0 |
241
+ | 4 | 20-25 | 2.5x | 2.5 | 2.5 |
242
+ | 5 | 30+ | 3.0x | 3.0 | 3.0 |
243
+
244
+ This ensures agents are incentivized to attempt harder environments rather than grinding easy Tier 1 snapshots.
245
+
246
  ## NPC Evolution: Shell Scripts to LLM Agents
247
 
248
  NPCs progress from mechanical noise generators to intelligent social engineering targets. Each level adds a modality without removing the previous one.