Spaces:
Runtime error
Runtime error
Aaron Brown Claude Opus 4.6 commited on
Commit Β·
819cfef
1
Parent(s): fd25822
Align docs with hackathon statements: multi-agent, self-improvement, simulated experts
Browse files- README: zero-sum multi-agent framing (Statement 1), long-horizon sparse
reward framing (Statement 2), curriculum feedback loop section (Statement 4),
simulated expert NPC framing (Snorkel), tier-scaled reward mention (Mercor)
- architecture.md: ComplexityBonus in reward tree, tier-scaled reward ceiling
table, curriculum section upgraded from post-hackathon to core
- IMPLEMENTATION_PLAN.md: curriculum feedback moved to Phase 3.8
- CLAUDE.md: r_complexity reward signal added to both Red and Blue tables
- Issue #11: updated with ComplexityBonus and tier multiplier table
- Issue #34: created for curriculum feedback loop
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- README.md +32 -5
- docs/architecture.md +24 -5
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# OpenRange
|
| 2 |
|
| 3 |
-
**Multi-agent cyber range with
|
| 4 |
|
| 5 |
The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
|
| 6 |
|
|
@@ -36,7 +36,7 @@ The OpenEnv runtime stays standard:
|
|
| 36 |
| **Red** | External attacker. Recon, exploit, pivot, escalate, exfiltrate. | Outside the firewall -- no creds, no access |
|
| 37 |
| **Blue** | Internal defender. SIEM analysis, patching, firewall rules, incident response. | SOC workstation on management network |
|
| 38 |
|
| 39 |
-
Red and Blue operate on the **same infrastructure simultaneously**. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs.
|
| 40 |
|
| 41 |
## Architecture
|
| 42 |
|
|
@@ -146,7 +146,7 @@ flowchart TB
|
|
| 146 |
|
| 147 |
Every service is real. The web app queries the database. Users authenticate against LDAP. Mail flows through Postfix. Logs stream to the SIEM. NPC traffic simulates employees browsing, sending email, and running cron jobs -- so Blue can't just flag everything as malicious.
|
| 148 |
|
| 149 |
-
NPCs evolve from shell-script noise generators to **LLM-driven
|
| 150 |
|
| 151 |
## Episode Lifecycle
|
| 152 |
|
|
@@ -279,7 +279,9 @@ with OpenRangeEnv('http://localhost:8000').sync() as env:
|
|
| 279 |
|
| 280 |
## Reward Signals
|
| 281 |
|
| 282 |
-
|
|
|
|
|
|
|
| 283 |
|
| 284 |
```mermaid
|
| 285 |
flowchart TB
|
|
@@ -374,6 +376,31 @@ flowchart TD
|
|
| 374 |
style t3 fill:#ff6b6b22,stroke:#ff6b6b
|
| 375 |
```
|
| 376 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 377 |
## Tandem Red + Blue Training
|
| 378 |
|
| 379 |
```mermaid
|
|
@@ -434,7 +461,7 @@ open-range/
|
|
| 434 |
## Built On
|
| 435 |
|
| 436 |
- [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
|
| 437 |
-
- Design ideas from PAIRED / UED (generate inside a legal family), POET (mutate plus admit), [R2E-Gym](https://arxiv.org/abs/2504.07164) (executable verification),
|
| 438 |
|
| 439 |
## License
|
| 440 |
|
|
|
|
| 1 |
# OpenRange
|
| 2 |
|
| 3 |
+
**Multi-agent cyber range with zero-sum Red/Blue dynamics, validated company snapshots, and self-improving enterprise worlds.**
|
| 4 |
|
| 5 |
The first cybersecurity environment in the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) ecosystem.
|
| 6 |
|
|
|
|
| 36 |
| **Red** | External attacker. Recon, exploit, pivot, escalate, exfiltrate. | Outside the firewall -- no creds, no access |
|
| 37 |
| **Blue** | Internal defender. SIEM analysis, patching, firewall rules, incident response. | SOC workstation on management network |
|
| 38 |
|
| 39 |
+
Red and Blue operate on the **same infrastructure simultaneously** in a zero-sum adversarial dynamic. Red's stealth reward depends on whether Blue catches them. Blue's detection reward depends on Red's actual actions in the logs. This multi-agent coupling creates natural co-evolution: as Red learns stealth, Blue must learn deeper detection -- and vice versa.
|
| 40 |
|
| 41 |
## Architecture
|
| 42 |
|
|
|
|
| 146 |
|
| 147 |
Every service is real. The web app queries the database. Users authenticate against LDAP. Mail flows through Postfix. Logs stream to the SIEM. NPC traffic simulates employees browsing, sending email, and running cron jobs -- so Blue can't just flag everything as malicious.
|
| 148 |
|
| 149 |
+
NPCs evolve from shell-script noise generators to **LLM-driven simulated experts** -- employees with persona cards, susceptibility profiles, and realistic communication styles. These are domain-specialized LLM agents (marketing coordinator, CISO, IT admin) that generate authentic enterprise behavior: sending emails, filing tickets, browsing intranet, and responding to social engineering attempts based on their security awareness level. Red can craft spearphishing emails, pretext calls, and watering-hole attacks against NPCs who decide whether to click, ignore, or report. Blue must detect these social engineering campaigns in logs alongside normal NPC traffic.
|
| 150 |
|
| 151 |
## Episode Lifecycle
|
| 152 |
|
|
|
|
| 279 |
|
| 280 |
## Reward Signals
|
| 281 |
|
| 282 |
+
Episodes are **long-horizon** (8-50+ steps depending on tier) with **sparse delayed rewards**. Flag capture is binary and only fires at the end of a successful exploit chain. Stealth and detection rewards are computed at episode end from the full action log. Intermediate steps yield only small efficiency signals -- agents must learn to plan multi-step strategies without dense per-action feedback.
|
| 283 |
+
|
| 284 |
+
All rewards are **verifiable** -- grounded in real container state, not LLM judgment. Reward ceilings **scale with environment complexity**: higher-tier snapshots (more hosts, zones, and chained vulnerabilities) offer proportionally larger maximum rewards, ensuring the training signal grows with output quality.
|
| 285 |
|
| 286 |
```mermaid
|
| 287 |
flowchart TB
|
|
|
|
| 376 |
style t3 fill:#ff6b6b22,stroke:#ff6b6b
|
| 377 |
```
|
| 378 |
|
| 379 |
+
## Curriculum Feedback Loop
|
| 380 |
+
|
| 381 |
+
OpenRange is **self-improving**. Per-snapshot solve rates and detection rates feed back to the Builder, which adjusts the next snapshot's difficulty and vulnerability mix to target the frontier of agent capability.
|
| 382 |
+
|
| 383 |
+
```
|
| 384 |
+
Episode results (solve rate, detection rate, time-to-flag)
|
| 385 |
+
|
|
| 386 |
+
v
|
| 387 |
+
Curriculum tracker (per vuln class, per tier)
|
| 388 |
+
|
|
| 389 |
+
v
|
| 390 |
+
Builder receives runtime_context:
|
| 391 |
+
{ red_solve_rate: 0.6, blue_detect_rate: 0.4,
|
| 392 |
+
previous_vuln_classes: [sqli, weak_creds],
|
| 393 |
+
weak_areas: [ssrf, chained_vulns] }
|
| 394 |
+
|
|
| 395 |
+
v
|
| 396 |
+
Next snapshot targets agent weaknesses:
|
| 397 |
+
- If Red solves SQLi easily β seed SSRF or chained vulns
|
| 398 |
+
- If Blue misses lateral movement β add more pivot points
|
| 399 |
+
- Difficulty adjusts via r_inject = 1 - (1+Ξ±)Β·s
|
| 400 |
+
```
|
| 401 |
+
|
| 402 |
+
The Builder LLM acts as a **simulated expert curriculum designer** -- it doesn't just randomize, it analyzes agent performance and generates challenges calibrated to the learning frontier. This is the same frontier-calibrating reward from Self-Play SWE-RL, adapted for cybersecurity.
|
| 403 |
+
|
| 404 |
## Tandem Red + Blue Training
|
| 405 |
|
| 406 |
```mermaid
|
|
|
|
| 461 |
## Built On
|
| 462 |
|
| 463 |
- [OpenEnv](https://github.com/meta-pytorch/OpenEnv) -- standardized agentic execution environments
|
| 464 |
+
- Design ideas from PAIRED / UED (generate inside a legal family), POET (mutate plus admit), [R2E-Gym](https://arxiv.org/abs/2504.07164) (executable verification), [Self-Play SWE-RL](https://arxiv.org/abs/2512.18552) (formal specs and inverse mutation testing), and [Snorkel](https://www.snorkel.ai/) (simulated domain experts for data generation)
|
| 465 |
|
| 466 |
## License
|
| 467 |
|
docs/architecture.md
CHANGED
|
@@ -179,12 +179,15 @@ sequenceDiagram
|
|
| 179 |
Note over T,R: Episode ends: flag captured, max steps, or timeout
|
| 180 |
```
|
| 181 |
|
| 182 |
-
### Curriculum
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
## Snapshot Artifacts
|
| 190 |
|
|
@@ -212,6 +215,7 @@ CompositeRedReward (WeightedSum)
|
|
| 212 |
βββ StealthReward coupled to Blue detection history
|
| 213 |
βββ EvidenceReward quality of submit_evidence
|
| 214 |
βββ SocialEngineeringReward NPC fell for phish/pretext (Level 1+)
|
|
|
|
| 215 |
βββ HallucinationPenalty -0.3 per fake flag
|
| 216 |
|
| 217 |
CompositeBlueReward (WeightedSum)
|
|
@@ -219,11 +223,26 @@ CompositeBlueReward (WeightedSum)
|
|
| 219 |
βββ PatchReward binary, golden path re-execution
|
| 220 |
βββ AvailabilityReward healthcheck fraction
|
| 221 |
βββ PhishingDetection correctly identified social engineering in logs (Level 1+)
|
|
|
|
| 222 |
βββ FalsePositiveReward -0.2 per NPC traffic/email flagged
|
| 223 |
```
|
| 224 |
|
| 225 |
Rewards are computed from **container state and action logs**, never from LLM judgment.
|
| 226 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
## NPC Evolution: Shell Scripts to LLM Agents
|
| 228 |
|
| 229 |
NPCs progress from mechanical noise generators to intelligent social engineering targets. Each level adds a modality without removing the previous one.
|
|
|
|
| 179 |
Note over T,R: Episode ends: flag captured, max steps, or timeout
|
| 180 |
```
|
| 181 |
|
| 182 |
+
### Curriculum Feedback
|
| 183 |
|
| 184 |
+
The Builder acts as a **simulated expert curriculum designer**. Episode results feed back to shape future snapshots:
|
| 185 |
+
|
| 186 |
+
1. Track Red solve rate and Blue detection rate per snapshot (per vuln class, per tier)
|
| 187 |
+
2. Feed failure stats to Builder as `runtime_context` on next build
|
| 188 |
+
3. Builder LLM adjusts difficulty via `r_inject = 1 - (1+alpha)*s` (frontier calibration from SWE-RL)
|
| 189 |
+
4. Target agent weaknesses: if Red masters SQLi, seed SSRF or chained vulns next
|
| 190 |
+
5. When agents plateau: horizontal growth (add containers, zones, services)
|
| 191 |
|
| 192 |
## Snapshot Artifacts
|
| 193 |
|
|
|
|
| 215 |
βββ StealthReward coupled to Blue detection history
|
| 216 |
βββ EvidenceReward quality of submit_evidence
|
| 217 |
βββ SocialEngineeringReward NPC fell for phish/pretext (Level 1+)
|
| 218 |
+
βββ ComplexityBonus tier_multiplier * base_reward (scales with snapshot complexity)
|
| 219 |
βββ HallucinationPenalty -0.3 per fake flag
|
| 220 |
|
| 221 |
CompositeBlueReward (WeightedSum)
|
|
|
|
| 223 |
βββ PatchReward binary, golden path re-execution
|
| 224 |
βββ AvailabilityReward healthcheck fraction
|
| 225 |
βββ PhishingDetection correctly identified social engineering in logs (Level 1+)
|
| 226 |
+
βββ ComplexityBonus tier_multiplier * base_reward (scales with snapshot complexity)
|
| 227 |
βββ FalsePositiveReward -0.2 per NPC traffic/email flagged
|
| 228 |
```
|
| 229 |
|
| 230 |
Rewards are computed from **container state and action logs**, never from LLM judgment.
|
| 231 |
|
| 232 |
+
### Tier-Scaled Reward Ceiling
|
| 233 |
+
|
| 234 |
+
Reward ceilings scale with environment complexity so that harder snapshots produce proportionally larger training signals:
|
| 235 |
+
|
| 236 |
+
| Tier | Hosts | Multiplier | Max Red Reward | Max Blue Reward |
|
| 237 |
+
|------|-------|-----------|----------------|-----------------|
|
| 238 |
+
| 1 | 6-8 | 1.0x | 1.0 | 1.0 |
|
| 239 |
+
| 2 | 10-12 | 1.5x | 1.5 | 1.5 |
|
| 240 |
+
| 3 | 14-18 | 2.0x | 2.0 | 2.0 |
|
| 241 |
+
| 4 | 20-25 | 2.5x | 2.5 | 2.5 |
|
| 242 |
+
| 5 | 30+ | 3.0x | 3.0 | 3.0 |
|
| 243 |
+
|
| 244 |
+
This ensures agents are incentivized to attempt harder environments rather than grinding easy Tier 1 snapshots.
|
| 245 |
+
|
| 246 |
## NPC Evolution: Shell Scripts to LLM Agents
|
| 247 |
|
| 248 |
NPCs progress from mechanical noise generators to intelligent social engineering targets. Each level adds a modality without removing the previous one.
|