Spaces:

abrown31
/

open-range

Runtime error

Aaron Brown Claude Opus 4.6 commited on Mar 8

Commit

819cfef

1 Parent(s): fd25822

Align docs with hackathon statements: multi-agent, self-improvement, simulated experts

- README: zero-sum multi-agent framing (Statement 1), long-horizon sparse
reward framing (Statement 2), curriculum feedback loop section (Statement 4),
simulated expert NPC framing (Snorkel), tier-scaled reward mention (Mercor)
- architecture.md: ComplexityBonus in reward tree, tier-scaled reward ceiling
table, curriculum section upgraded from post-hackathon to core
- IMPLEMENTATION_PLAN.md: curriculum feedback moved to Phase 3.8
- CLAUDE.md: r_complexity reward signal added to both Red and Blue tables
- Issue #11: updated with ComplexityBonus and tier multiplier table
- Issue #34: created for curriculum feedback loop

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (2) hide show

README.md +32 -5
docs/architecture.md +24 -5

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # OpenRange
-**Multi-agent cyber range with validated company snapshots, coupled Red/Blue rewards, and evolving enterprise worlds.**
 The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
@@ -36,7 +36,7 @@ The OpenEnv runtime stays standard:
 | **Red** | External attacker. Recon, exploit, pivot, escalate, exfiltrate. | Outside the firewall -- no creds, no access |
 | **Blue** | Internal defender. SIEM analysis, patching, firewall rules, incident response. | SOC workstation on management network |
-Red and Blue operate on the **same infrastructure simultaneously**. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs.
 ## Architecture
@@ -146,7 +146,7 @@ flowchart TB
 Every service is real. The web app queries the database. Users authenticate against LDAP. Mail flows through Postfix. Logs stream to the SIEM. NPC traffic simulates employees browsing, sending email, and running cron jobs -- so Blue can't just flag everything as malicious.
-NPCs evolve from shell-script noise generators to **LLM-driven employees** with persona cards, susceptibility profiles, and realistic communication styles. Red can craft spearphishing emails, pretext calls, and watering-hole attacks against NPCs who decide whether to click, ignore, or report based on their security awareness. Blue must detect these social engineering campaigns in logs alongside normal traffic.
 ## Episode Lifecycle
@@ -279,7 +279,9 @@ with OpenRangeEnv('http://localhost:8000').sync() as env:
 ## Reward Signals
-All rewards are **verifiable** -- grounded in real container state, not LLM judgment.
 ```mermaid
 flowchart TB
@@ -374,6 +376,31 @@ flowchart TD
     style t3 fill:#ff6b6b22,stroke:#ff6b6b
 ```
 ## Tandem Red + Blue Training
 ```mermaid
@@ -434,7 +461,7 @@ open-range/
 ## Built On
 - [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
-- Design ideas from PAIRED / UED (generate inside a legal family), POET (mutate plus admit), [R2E-Gym](https://arxiv.org/abs/2504.07164) (executable verification), and [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552) (formal specs and inverse mutation testing)
 ## License

 # OpenRange
+**Multi-agent cyber range with zero-sum Red/Blue dynamics, validated company snapshots, and self-improving enterprise worlds.**
 The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
 | **Red** | External attacker. Recon, exploit, pivot, escalate, exfiltrate. | Outside the firewall -- no creds, no access |
 | **Blue** | Internal defender. SIEM analysis, patching, firewall rules, incident response. | SOC workstation on management network |
+Red and Blue operate on the **same infrastructure simultaneously** in a zero-sum adversarial dynamic. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs. This multi-agent coupling creates natural co-evolution: as Red learns stealth, Blue must learn deeper detection -- and vice versa.
 ## Architecture
 Every service is real. The web app queries the database. Users authenticate against LDAP. Mail flows through Postfix. Logs stream to the SIEM. NPC traffic simulates employees browsing, sending email, and running cron jobs -- so Blue can't just flag everything as malicious.
+NPCs evolve from shell-script noise generators to **LLM-driven simulated experts** -- employees with persona cards, susceptibility profiles, and realistic communication styles. These are domain-specialized LLM agents (marketing coordinator, CISO, IT admin) that generate authentic enterprise behavior: sending emails, filing tickets, browsing intranet, and responding to social engineering attempts based on their security awareness level. Red can craft spearphishing emails, pretext calls, and watering-hole attacks against NPCs who decide whether to click, ignore, or report. Blue must detect these social engineering campaigns in logs alongside normal NPC traffic.
 ## Episode Lifecycle
 ## Reward Signals
+Episodes are **long-horizon** (8-50+ steps depending on tier) with **sparse delayed rewards**. Flag capture is binary and only fires at the end of a successful exploit chain. Stealth and detection rewards are computed at episode end from the full action log. Intermediate steps yield only small efficiency signals -- agents must learn to plan multi-step strategies without dense per-action feedback.
+All rewards are **verifiable** -- grounded in real container state, not LLM judgment. Reward ceilings **scale with environment complexity**: higher-tier snapshots (more hosts, zones, and chained vulnerabilities) offer proportionally larger maximum rewards, ensuring the training signal grows with output quality.
 ```mermaid
 flowchart TB
     style t3 fill:#ff6b6b22,stroke:#ff6b6b
 ```
+## Curriculum Feedback Loop
+OpenRange is **self-improving**. Per-snapshot solve rates and detection rates feed back to the Builder, which adjusts the next snapshot's difficulty and vulnerability mix to target the frontier of agent capability.
+```
+Episode results (solve rate, detection rate, time-to-flag)
+    |
+    v
+Curriculum tracker (per vuln class, per tier)
+    |
+    v
+Builder receives runtime_context:
+  { red_solve_rate: 0.6, blue_detect_rate: 0.4,
+    previous_vuln_classes: [sqli, weak_creds],
+    weak_areas: [ssrf, chained_vulns] }
+    |
+    v
+Next snapshot targets agent weaknesses:
+  - If Red solves SQLi easily → seed SSRF or chained vulns
+  - If Blue misses lateral movement → add more pivot points
+  - Difficulty adjusts via r_inject = 1 - (1+α)·s
+```
+The Builder LLM acts as a **simulated expert curriculum designer** -- it doesn't just randomize, it analyzes agent performance and generates challenges calibrated to the learning frontier. This is the same frontier-calibrating reward from Self-Play SWE-RL, adapted for cybersecurity.
 ## Tandem Red + Blue Training
 ```mermaid
 ## Built On
 - [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
+- Design ideas from PAIRED / UED (generate inside a legal family), POET (mutate plus admit), [R2E-Gym](https://arxiv.org/abs/2504.07164) (executable verification), [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552) (formal specs and inverse mutation testing), and [Snorkel](https://www.snorkel.ai/) (simulated domain experts for data generation)
 ## License

docs/architecture.md CHANGED Viewed

@@ -179,12 +179,15 @@ sequenceDiagram
     Note over T,R: Episode ends: flag captured, max steps, or timeout
 ```
-### Curriculum (post-hackathon)
-1. Track Red solve rate and Blue detection rate per snapshot
-2. Feed failure stats back to builder for next mutation
-3. Builder LLM adjusts difficulty via `r_inject = 1 - (1+alpha)*s`
-4. When agents plateau: horizontal growth (add containers, zones, services)
 ## Snapshot Artifacts
@@ -212,6 +215,7 @@ CompositeRedReward (WeightedSum)
   ├── StealthReward           coupled to Blue detection history
   ├── EvidenceReward          quality of submit_evidence
   ├── SocialEngineeringReward NPC fell for phish/pretext (Level 1+)
   └── HallucinationPenalty    -0.3 per fake flag
 CompositeBlueReward (WeightedSum)
@@ -219,11 +223,26 @@ CompositeBlueReward (WeightedSum)
   ├── PatchReward             binary, golden path re-execution
   ├── AvailabilityReward      healthcheck fraction
   ├── PhishingDetection       correctly identified social engineering in logs (Level 1+)
   └── FalsePositiveReward     -0.2 per NPC traffic/email flagged
 ```
 Rewards are computed from **container state and action logs**, never from LLM judgment.
 ## NPC Evolution: Shell Scripts to LLM Agents
 NPCs progress from mechanical noise generators to intelligent social engineering targets. Each level adds a modality without removing the previous one.

     Note over T,R: Episode ends: flag captured, max steps, or timeout
 ```
+### Curriculum Feedback
+The Builder acts as a **simulated expert curriculum designer**. Episode results feed back to shape future snapshots:
+1. Track Red solve rate and Blue detection rate per snapshot (per vuln class, per tier)
+2. Feed failure stats to Builder as `runtime_context` on next build
+3. Builder LLM adjusts difficulty via `r_inject = 1 - (1+alpha)*s` (frontier calibration from SWE-RL)
+4. Target agent weaknesses: if Red masters SQLi, seed SSRF or chained vulns next
+5. When agents plateau: horizontal growth (add containers, zones, services)
 ## Snapshot Artifacts
   ├── StealthReward           coupled to Blue detection history
   ├── EvidenceReward          quality of submit_evidence
   ├── SocialEngineeringReward NPC fell for phish/pretext (Level 1+)
+  ├── ComplexityBonus          tier_multiplier * base_reward (scales with snapshot complexity)
   └── HallucinationPenalty    -0.3 per fake flag
 CompositeBlueReward (WeightedSum)
   ├── PatchReward             binary, golden path re-execution
   ├── AvailabilityReward      healthcheck fraction
   ├── PhishingDetection       correctly identified social engineering in logs (Level 1+)
+  ├── ComplexityBonus          tier_multiplier * base_reward (scales with snapshot complexity)
   └── FalsePositiveReward     -0.2 per NPC traffic/email flagged
 ```
 Rewards are computed from **container state and action logs**, never from LLM judgment.
+### Tier-Scaled Reward Ceiling
+Reward ceilings scale with environment complexity so that harder snapshots produce proportionally larger training signals:
+| Tier | Hosts | Multiplier | Max Red Reward | Max Blue Reward |
+|------|-------|-----------|----------------|-----------------|
+| 1 | 6-8 | 1.0x | 1.0 | 1.0 |
+| 2 | 10-12 | 1.5x | 1.5 | 1.5 |
+| 3 | 14-18 | 2.0x | 2.0 | 2.0 |
+| 4 | 20-25 | 2.5x | 2.5 | 2.5 |
+| 5 | 30+ | 3.0x | 3.0 | 3.0 |
+This ensures agents are incentivized to attempt harder environments rather than grinding easy Tier 1 snapshots.
 ## NPC Evolution: Shell Scripts to LLM Agents
 NPCs progress from mechanical noise generators to intelligent social engineering targets. Each level adds a modality without removing the previous one.