Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Commit ·
06332ca
1
Parent(s): 756be0d
Update Blog and Readme
Browse files- Blogs.md +105 -27
- README.md +23 -18
- cyber_selfplay_env/simulator.py +1 -1
- notebook/League(PFSP).ipynb +0 -0
- notebook/League_(PFSP_+_PSRO).ipynb +0 -0
- notebook/League_(PSRO) (1).ipynb +0 -0
- notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb +0 -0
- notebook/SFT_→_GRPO_(Vanilla).ipynb +0 -0
- openenv.yaml +125 -2
- pyproject.toml +1 -1
Blogs.md
CHANGED
|
@@ -1,8 +1,12 @@
|
|
| 1 |
-
# CyberSelfPlay: Building a
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
## Why this environment exists
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
- decisions unfold over many steps,
|
| 8 |
- observations are partial and noisy,
|
|
@@ -11,7 +15,13 @@ Most agent benchmarks are short-horizon and mostly single-agent. Real cyber defe
|
|
| 11 |
|
| 12 |
CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.
|
| 13 |
|
| 14 |
-
What makes this direction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
---
|
| 17 |
|
|
@@ -42,7 +52,7 @@ $$
|
|
| 42 |
r_B=-r_R-\lambda C_{\mathrm{collateral}}.
|
| 43 |
$$
|
| 44 |
|
| 45 |
-
In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders
|
| 46 |
|
| 47 |
---
|
| 48 |
|
|
@@ -55,14 +65,16 @@ At a high level, the system has:
|
|
| 55 |
- reward rubrics,
|
| 56 |
- metrics and progress tracking,
|
| 57 |
- scenario definitions and tool interfaces.
|
| 58 |
-
2. **API server**
|
| 59 |
- OpenEnv endpoints for interaction.
|
| 60 |
-
3. **Training
|
| 61 |
-
-
|
| 62 |
-
-
|
| 63 |
|
| 64 |
Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.
|
| 65 |
|
|
|
|
|
|
|
| 66 |
---
|
| 67 |
|
| 68 |
## Observations, actions, rewards
|
|
@@ -129,7 +141,7 @@ In practice, this looked like “safe but repetitive” behavior: valid JSON, bu
|
|
| 129 |
|
| 130 |
### Step 3: Add stabilization in single-policy GRPO
|
| 131 |
|
| 132 |
-
In
|
| 133 |
|
| 134 |
- group-level diversity penalty when one tool dominates a batch,
|
| 135 |
- additional nudge against overusing `execute_instruction` when SFT bias is high,
|
|
@@ -141,7 +153,7 @@ This step is important because it addresses a common failure mode in small-model
|
|
| 141 |
|
| 142 |
### Step 4: Move to league training for broader robustness
|
| 143 |
|
| 144 |
-
Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We moved to
|
| 145 |
|
| 146 |
- run multiple league rounds,
|
| 147 |
- pick Red archetypes using PFSP / PSRO / mix,
|
|
@@ -166,12 +178,76 @@ This is where behavior starts to look more “field-like”: the defender is not
|
|
| 166 |
|
| 167 |
### Step 5: Turn logs into evidence, not just numbers
|
| 168 |
|
| 169 |
-
We deliberately kept artifact generation rich (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
## Results and evidence
|
| 174 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
Across runs, we observe the expected pattern:
|
| 176 |
|
| 177 |
- Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
|
|
@@ -186,14 +262,13 @@ A useful way to read these results is:
|
|
| 186 |
|
| 187 |
By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.
|
| 188 |
|
| 189 |
-
Primary artifacts produced
|
| 190 |
|
| 191 |
-
-
|
| 192 |
-
-
|
| 193 |
-
-
|
| 194 |
-
-
|
| 195 |
-
-
|
| 196 |
-
- league-specific: `training_curves_all_rounds.png`, `league_state.jsonl`, `log_history_combined.json`
|
| 197 |
|
| 198 |
These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.
|
| 199 |
|
|
@@ -203,12 +278,12 @@ These files are the evidence trail for reward trends, variance, action diversity
|
|
| 203 |
|
| 204 |
CyberSelfPlay matters because it evaluates what real defenders need:
|
| 205 |
|
| 206 |
-
-
|
| 207 |
- adversarial interaction under uncertainty,
|
| 208 |
- measurable progress beyond one-step task completion.
|
| 209 |
|
| 210 |
-
For practitioners, it is closer to incident response realities.
|
| 211 |
-
For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior.
|
| 212 |
|
| 213 |
For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.
|
| 214 |
For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.
|
|
@@ -217,13 +292,16 @@ For evaluation-focused work, it provides a bridge between toy tasks and operatio
|
|
| 217 |
|
| 218 |
## Why this submission can stand out
|
| 219 |
|
| 220 |
-
- It tackles a
|
| 221 |
-
- It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure.
|
| 222 |
-
- It includes mathematical grounding, system-level structure, and artifact-level evidence in one coherent package.
|
| 223 |
-
- The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible.
|
| 224 |
|
| 225 |
---
|
| 226 |
|
| 227 |
-
##
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
-
-
|
|
|
|
| 1 |
+
# CyberSelfPlay: Building a Cyber Defense Environment
|
| 2 |
+
|
| 3 |
+
**Important links:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)
|
| 4 |
+
|
| 5 |
+
**Documentation:** the math-led overview, full training table, and repository layout are in the [project README](README.md) (and this file links back to it in [Where to go next](#where-to-go-next)).
|
| 6 |
|
| 7 |
## Why this environment exists
|
| 8 |
|
| 9 |
+
In **real-world** security operations, impact is not a single model score. It is whether a team can run long incident timelines under uncertainty while adversaries adapt. **Industry** and government playbooks for detection, containment, and recovery read like multi-step missions, not one-shot classifiers. Yet most agent benchmarks are short-horizon and mostly single-agent. Cyber defense in practice is neither:
|
| 10 |
|
| 11 |
- decisions unfold over many steps,
|
| 12 |
- observations are partial and noisy,
|
|
|
|
| 15 |
|
| 16 |
CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.
|
| 17 |
|
| 18 |
+
What makes this direction **novel** is that we do not treat “good defense” as a single yes/no check. The agent must keep making good choices for many steps in a row, under changing pressure, while mission goals are still active. That combination (multi-step behavior + partial visibility + active adversary + mission constraints) is where many current benchmarks become too easy or unrealistic, and where **industry**-relevant **impact** is actually decided.
|
| 19 |
+
|
| 20 |
+
### How this lines up with long-horizon and self-improvement themes
|
| 21 |
+
|
| 22 |
+
**Theme: (super) long-horizon planning and instruction following.** Missions are **long-running** by design: scenarios scale to **many** instructions and checkpoints, with **sparse and delayed** rewards from security and mission rubrics. The agent must **decompose** response goals, **track** state and playbook progress under partial visibility, and **recover** from early mistakes over **extended trajectories**—closer to durable planning than one-shot next responses.
|
| 23 |
+
|
| 24 |
+
**Theme: self-improvement and adaptive curricula.** The **Red vs. Blue** loop is explicit **self-play** over a **defined** scenario family. **League** work (PFSP, PSRO, and mixed) plus round-based **GRPO** changes the **opponent mix** and pressure across training, so improvement is not fitting a static list of tasks but **recursive capability growth** driven by an **adaptive curriculum** and interaction feedback on the same environment.
|
| 25 |
|
| 26 |
---
|
| 27 |
|
|
|
|
| 52 |
r_B=-r_R-\lambda C_{\mathrm{collateral}}.
|
| 53 |
$$
|
| 54 |
|
| 55 |
+
In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders see in **real-world** response and the trade-offs that show up in **industry** debriefs.
|
| 56 |
|
| 57 |
---
|
| 58 |
|
|
|
|
| 65 |
- reward rubrics,
|
| 66 |
- metrics and progress tracking,
|
| 67 |
- scenario definitions and tool interfaces.
|
| 68 |
+
2. **API server**
|
| 69 |
- OpenEnv endpoints for interaction.
|
| 70 |
+
3. **Training pipelines**
|
| 71 |
+
- a single-policy path (SFT -> GRPO),
|
| 72 |
+
- and a league-based path (SFT -> rounds + mini-GRPO + PFSP/PSRO updates).
|
| 73 |
|
| 74 |
Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.
|
| 75 |
|
| 76 |
+
**System figures (open as links or view inline in [Colab, diagrams, and repository notebooks](#colab-diagrams-and-repository-notebooks)):** the [**environment architecture** diagram (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) shows how the environment, server, and training stack connect; the [**end-to-end training flow** (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) summarizes SFT, GRPO, and league training at a glance. Placing the links here matches how engineers skim a project: first the shape of the system, then the pipeline.
|
| 77 |
+
|
| 78 |
---
|
| 79 |
|
| 80 |
## Observations, actions, rewards
|
|
|
|
| 141 |
|
| 142 |
### Step 3: Add stabilization in single-policy GRPO
|
| 143 |
|
| 144 |
+
In the single-policy training path, we introduced shaping aligned with this issue:
|
| 145 |
|
| 146 |
- group-level diversity penalty when one tool dominates a batch,
|
| 147 |
- additional nudge against overusing `execute_instruction` when SFT bias is high,
|
|
|
|
| 153 |
|
| 154 |
### Step 4: Move to league training for broader robustness
|
| 155 |
|
| 156 |
+
Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We then moved to a league-based training loop:
|
| 157 |
|
| 158 |
- run multiple league rounds,
|
| 159 |
- pick Red archetypes using PFSP / PSRO / mix,
|
|
|
|
| 178 |
|
| 179 |
### Step 5: Turn logs into evidence, not just numbers
|
| 180 |
|
| 181 |
+
We deliberately kept artifact generation rich (training curves, per-step logs, and combined league histories) so claims can be traced back to concrete run outputs. That makes debugging, comparison, and review much more grounded.
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## Colab, diagrams, and repository notebooks
|
| 186 |
+
|
| 187 |
+
The README documents the same training recipes with **public Colab** links and **static curve images** (repeated below under [Results and evidence](#results-and-evidence)). In the repo, the `notebook/` directory holds local copies aligned with each recipe.
|
| 188 |
+
|
| 189 |
+
### Environment diagrams (from the README)
|
| 190 |
+
|
| 191 |
+
These SVGs are the high-level system view and training pipeline, as in the [README `Environment Architecture` and `Training Flow` sections](README.md#environment-architecture). You can open each asset directly: [**architecture (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) · [**training flow (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg).
|
| 192 |
+
|
| 193 |
+
**Architecture**
|
| 194 |
+
|
| 195 |
+
[Open architecture diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg)
|
| 196 |
+
|
| 197 |
+
<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg" width="800" alt="CyberSelfPlay environment architecture" />
|
| 198 |
+
|
| 199 |
+
**Training flow**
|
| 200 |
+
|
| 201 |
+
[Open training flow diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg)
|
| 202 |
+
|
| 203 |
+
<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg" width="800" alt="Training flow from SFT to GRPO and league" />
|
| 204 |
+
|
| 205 |
+
### Colab notebooks and what each path does
|
| 206 |
+
|
| 207 |
+
| Method | Open in Colab | Local notebook in `notebook/` | In short |
|
| 208 |
+
|--------|----------------|-------------------------------|----------|
|
| 209 |
+
| **SFT → GRPO (Vanilla)** | [Open in Colab](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | `SFT_→_GRPO_(Vanilla).ipynb` | Supervised fine-tuning on trajectory-style data, then **vanilla GRPO** with the environment reward only: the baseline for single-policy learning. |
|
| 210 |
+
| **SFT → GRPO (Anti-Collapse)** | [Open in Colab](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | `SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb` | Same SFT + GRPO stack with **diversity / anti-collapse** regularization so the policy does not collapse to a tiny set of tool actions. |
|
| 211 |
+
| **League (PFSP)** | [Open in Colab](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | `League(PFSP).ipynb` | **League** training with **Prioritized Fictitious Self-Play**: opponents are sampled with weights tied to matchups, so the defender faces a shifting mixture of Red styles. |
|
| 212 |
+
| **League (PSRO)** | [Open in Colab](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | `League_(PSRO) (1).ipynb` | League loop using **PSRO-style** meta-updates on a population of policies (response oracles) rather than only PFSP sampling. |
|
| 213 |
+
| **League (PFSP + PSRO)** | [Open in Colab](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | `League_(PFSP_+_PSRO).ipynb` | **Combined** path: PFSP for opponent (policy) choice plus PSRO-style weighting so sampling and meta-game updates run together. |
|
| 214 |
+
|
| 215 |
+
The notebooks mirror the table in the [README `Training Approaches` section](README.md#-training-approaches-in-this-project); Colab is the shareable run surface, and the `notebook/` files are the offline copies in this repository.
|
| 216 |
|
| 217 |
---
|
| 218 |
|
| 219 |
## Results and evidence
|
| 220 |
|
| 221 |
+
### Figures from training runs (same assets as the README)
|
| 222 |
+
|
| 223 |
+
Below are the **SFT / GRPO / league** curve figures linked from the README’s training table, plus the **SFT training loss** plot referenced for this write-up. Together they are the main visual evidence for convergence and per-method behavior.
|
| 224 |
+
|
| 225 |
+
**SFT training loss (cross-entropy on expert trajectories).** The run shows a clean optimization trajectory: loss starts around **3.2–3.3**, stays almost flat for the first few steps, then falls steeply from roughly step **5** through **25**. After that the curve flattens: from about step **30** onward training loss sits near **0.1** (steps on the x-axis go up to about **37**), which indicates that the SFT stage has found a low-NLL fit on the demonstration data.
|
| 226 |
+
|
| 227 |
+
<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193812/image_8_s3tzys.png" width="700" alt="SFT training loss vs steps" />
|
| 228 |
+
|
| 229 |
+
**SFT → GRPO (Vanilla).**
|
| 230 |
+
|
| 231 |
+
<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="700" alt="SFT to GRPO Vanilla metrics" />
|
| 232 |
+
|
| 233 |
+
**SFT → GRPO (Anti-Collapse).**
|
| 234 |
+
|
| 235 |
+
<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="700" alt="SFT to GRPO with anti-collapse regularization" />
|
| 236 |
+
|
| 237 |
+
**League (PFSP).**
|
| 238 |
+
|
| 239 |
+
<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="700" alt="League PFSP training curves" />
|
| 240 |
+
|
| 241 |
+
**League (PSRO).**
|
| 242 |
+
|
| 243 |
+
<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="700" alt="League PSRO training curves" />
|
| 244 |
+
|
| 245 |
+
**League (PFSP + PSRO).**
|
| 246 |
+
|
| 247 |
+
<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="700" alt="League PFSP and PSRO combined" />
|
| 248 |
+
|
| 249 |
+
### Interpretation in one pass
|
| 250 |
+
|
| 251 |
Across runs, we observe the expected pattern:
|
| 252 |
|
| 253 |
- Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
|
|
|
|
| 262 |
|
| 263 |
By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.
|
| 264 |
|
| 265 |
+
Primary artifacts produced during training:
|
| 266 |
|
| 267 |
+
- consolidated training curves
|
| 268 |
+
- full optimization history logs
|
| 269 |
+
- per-step reward traces
|
| 270 |
+
- per-step behavior snapshots
|
| 271 |
+
- league-specific multi-round trend and meta-state reports
|
|
|
|
| 272 |
|
| 273 |
These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.
|
| 274 |
|
|
|
|
| 278 |
|
| 279 |
CyberSelfPlay matters because it evaluates what real defenders need:
|
| 280 |
|
| 281 |
+
- multi-step, instruction-conditioned recovery,
|
| 282 |
- adversarial interaction under uncertainty,
|
| 283 |
- measurable progress beyond one-step task completion.
|
| 284 |
|
| 285 |
+
For **industry** practitioners, it is closer to incident response realities and to how blue teams think about time-to-detect, containment, and recovery.
|
| 286 |
+
For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior, with a **novel** mix of instruction following, tools, and adversarial pressure in one environment.
|
| 287 |
|
| 288 |
For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.
|
| 289 |
For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.
|
|
|
|
| 292 |
|
| 293 |
## Why this submission can stand out
|
| 294 |
|
| 295 |
+
- It tackles a **real-world**-tilted setting that combines multi-step behavior, partial observability, adversarial play, and mission objectives in one benchmark, which is an unusual and **impact**-relevant target for the field.
|
| 296 |
+
- It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure, with clear **industry**-minded artifacts (curves, logs, league history).
|
| 297 |
+
- It includes mathematical grounding, system-level structure, and diagram-level **novelty** in how the stack is presented (see [architecture](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) and [training flow](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) links in [Core components](#core-components) and [diagrams](#environment-diagrams-from-the-readme)), plus artifact-level evidence in one coherent package.
|
| 298 |
+
- The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible, which is what teams need to trust deployment-related claims.
|
| 299 |
|
| 300 |
---
|
| 301 |
|
| 302 |
+
## Where to go next
|
| 303 |
+
|
| 304 |
+
- **Project README** (formal POSG, rewards, training math, and full method table): [README.md](README.md)
|
| 305 |
+
- **Hugging Face Space (live environment):** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026)
|
| 306 |
|
| 307 |
+
The README and this blog point to each other so you can move between the specification-style overview and the narrative plus figures here.
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: CyberSelfPlay (
|
| 3 |
emoji: 🛡️
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: red
|
|
@@ -10,17 +10,22 @@ pinned: true
|
|
| 10 |
|
| 11 |
# CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Environment on Hugging Face Space
|
| 16 |
|
| 17 |
-
- **Live Space:**
|
|
|
|
| 18 |
|
| 19 |
---
|
| 20 |
|
| 21 |
## Problem and Capability Gap
|
| 22 |
|
| 23 |
-
Most agent benchmarks are short
|
|
|
|
|
|
|
| 24 |
|
| 25 |
---
|
| 26 |
|
|
@@ -108,7 +113,7 @@ r_B &= v_1 \mathbb{1}_{\mathrm{detect}} + v_2 \mathbb{1}_{\mathrm{contain}} + v_
|
|
| 108 |
\end{aligned}
|
| 109 |
$$
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
---
|
| 114 |
|
|
@@ -137,9 +142,9 @@ We experiment across **SFT + GRPO baselines**, **reward smoothing**, **diversity
|
|
| 137 |
| **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="350"/>|
|
| 138 |
| **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="350"/> |
|
| 139 |
| **🔹 League (Multi-Policy RL)** ||||
|
| 140 |
-
| **League (PFSP)** | Prioritized Fictitious Self-Play for opponent sampling | [Open](https://colab.research.google.com/drive/
|
| 141 |
-
| **League (PSRO)** | Policy-Space Response Oracles (game-theoretic updates) | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/
|
| 142 |
-
| **League (PFSP + PSRO)** | Combines adaptive sampling + meta-policy optimization | [Open](https://colab.research.google.com/drive/
|
| 143 |
|
| 144 |
---
|
| 145 |
|
|
@@ -239,7 +244,7 @@ Score each completion with reward $R^{(j)}$, compute group-relative advantages,
|
|
| 239 |
|
| 240 |
---
|
| 241 |
|
| 242 |
-
##
|
| 243 |
|
| 244 |
| scenario | turns | instructions | checkpoint stride |
|
| 245 |
| --- | ---: | ---: | ---: |
|
|
@@ -255,20 +260,20 @@ Instruction progress and violation signals are tracked in environment metadata.
|
|
| 255 |
|
| 256 |
Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, round-level opponent selection (PFSP / PSRO / mix) changes pressure distribution and produces distinct multi-round learning dynamics.
|
| 257 |
|
| 258 |
-
Common result artifacts produced by
|
| 259 |
|
| 260 |
-
-
|
| 261 |
-
-
|
| 262 |
-
-
|
| 263 |
-
-
|
| 264 |
-
- per-step
|
| 265 |
-
-
|
| 266 |
|
| 267 |
---
|
| 268 |
|
| 269 |
## Why It Matters
|
| 270 |
|
| 271 |
-
- **Security operations relevance:** models
|
| 272 |
- **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty.
|
| 273 |
- **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes.
|
| 274 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: CyberSelfPlay (Cyber POSG)
|
| 3 |
emoji: 🛡️
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: red
|
|
|
|
| 10 |
|
| 11 |
# CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
|
| 12 |
|
| 13 |
+
**Important links:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)
|
| 14 |
+
|
| 15 |
+
CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.
|
| 16 |
|
| 17 |
## Environment on Hugging Face Space
|
| 18 |
|
| 19 |
+
- **Live Space:** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026)
|
| 20 |
+
- **Narrative, Colab context, and results figures:** [Blogs](Blogs.md)
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
## Problem and Capability Gap
|
| 25 |
|
| 26 |
+
Most agent benchmarks are short and single-agent. Cyber defense in practice is multi-step, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.
|
| 27 |
+
|
| 28 |
+
**Connection to long-horizon and self-play themes:** the setting stresses **(super) long-horizon planning and instruction following**—episodes with many steps, many playbook instructions, and security rewards that are often sparse or delayed, so the agent must track state, recover from mis-steps, and keep coherent plans across long runs. It also supports **self-improvement through interaction**: the training stack uses **SFT → GRPO** and **league (PFSP / PSRO / mix)** to keep pressure adaptive—opponents and rounds change, so the LLM policy is not tuned on a static task set but on an evolving, self-play–style curriculum over the same family of tasks.
|
| 29 |
|
| 30 |
---
|
| 31 |
|
|
|
|
| 113 |
\end{aligned}
|
| 114 |
$$
|
| 115 |
|
| 116 |
+
The reward rubric is implemented directly in the environment’s scoring logic.
|
| 117 |
|
| 118 |
---
|
| 119 |
|
|
|
|
| 142 |
| **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="350"/>|
|
| 143 |
| **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="350"/> |
|
| 144 |
| **🔹 League (Multi-Policy RL)** ||||
|
| 145 |
+
| **League (PFSP)** | Prioritized Fictitious Self-Play for opponent sampling | [Open](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="350"/> |
|
| 146 |
+
| **League (PSRO)** | Policy-Space Response Oracles (game-theoretic updates) | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="350"/> |
|
| 147 |
+
| **League (PFSP + PSRO)** | Combines adaptive sampling + meta-policy optimization | [Open](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="350"/> |
|
| 148 |
|
| 149 |
---
|
| 150 |
|
|
|
|
| 244 |
|
| 245 |
---
|
| 246 |
|
| 247 |
+
## Scenario Scale
|
| 248 |
|
| 249 |
| scenario | turns | instructions | checkpoint stride |
|
| 250 |
| --- | ---: | ---: | ---: |
|
|
|
|
| 260 |
|
| 261 |
Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, round-level opponent selection (PFSP / PSRO / mix) changes pressure distribution and produces distinct multi-round learning dynamics.
|
| 262 |
|
| 263 |
+
Common result artifacts produced by training include:
|
| 264 |
|
| 265 |
+
- consolidated training curves,
|
| 266 |
+
- step-by-step optimization history,
|
| 267 |
+
- metrics logs,
|
| 268 |
+
- per-sample reward traces,
|
| 269 |
+
- per-step visualization snapshots,
|
| 270 |
+
- and, for league experiments, combined multi-round trend and meta-state reports.
|
| 271 |
|
| 272 |
---
|
| 273 |
|
| 274 |
## Why It Matters
|
| 275 |
|
| 276 |
+
- **Security operations relevance:** models multi-step defense decisions closer to real incident response.
|
| 277 |
- **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty.
|
| 278 |
- **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes.
|
| 279 |
|
cyber_selfplay_env/simulator.py
CHANGED
|
@@ -159,7 +159,7 @@ class CyberSimulator:
|
|
| 159 |
pending = [x for x in mission["instructions"] if not x["done"]]
|
| 160 |
if pending:
|
| 161 |
current = pending[0]
|
| 162 |
-
# Requires matching tool in params to model
|
| 163 |
requested_tool = ""
|
| 164 |
if params and isinstance(params.get("required_tool"), str):
|
| 165 |
requested_tool = params["required_tool"]
|
|
|
|
| 159 |
pending = [x for x in mission["instructions"] if not x["done"]]
|
| 160 |
if pending:
|
| 161 |
current = pending[0]
|
| 162 |
+
# Requires matching tool in params to model multi-step instruction following.
|
| 163 |
requested_tool = ""
|
| 164 |
if params and isinstance(params.get("required_tool"), str):
|
| 165 |
requested_tool = params["required_tool"]
|
notebook/League(PFSP).ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/League_(PFSP_+_PSRO).ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/League_(PSRO) (1).ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
notebook/SFT_→_GRPO_(Vanilla).ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
openenv.yaml
CHANGED
|
@@ -1,16 +1,139 @@
|
|
| 1 |
env:
|
| 2 |
name: "CyberSelfPlay"
|
| 3 |
-
author: "
|
| 4 |
-
description: "Long-horizon
|
| 5 |
version: "0.1.0"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
server:
|
| 8 |
host: "0.0.0.0"
|
| 9 |
port: 7870
|
| 10 |
workers: 1
|
| 11 |
module: "server.app:app"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
features:
|
| 14 |
multi_reward: true
|
| 15 |
prevent_hacking: true
|
| 16 |
curriculum_scheduler: true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
env:
|
| 2 |
name: "CyberSelfPlay"
|
| 3 |
+
author: "Team Neuron"
|
| 4 |
+
description: "Long-horizon Red-vs-Blue cyber defense POSG with partial observability, stochastic transitions, mission-instruction progress signals, and league-style opponent pressure for robust policy learning."
|
| 5 |
version: "0.1.0"
|
| 6 |
+
homepage: "https://huggingface.co/spaces/HarshitShri026"
|
| 7 |
+
domain: "cyber-defense"
|
| 8 |
+
tags:
|
| 9 |
+
- "openenv"
|
| 10 |
+
- "cybersecurity"
|
| 11 |
+
- "red-vs-blue"
|
| 12 |
+
- "multi-agent"
|
| 13 |
+
- "multi-step"
|
| 14 |
+
- "partially-observable"
|
| 15 |
+
- "instruction-following"
|
| 16 |
+
- "reinforcement-learning"
|
| 17 |
+
- "long-horizon"
|
| 18 |
+
- "self-play"
|
| 19 |
+
- "adaptive-curriculum"
|
| 20 |
+
# Aligns with program themes: (1) long-horizon planning & instruction following;
|
| 21 |
+
# (2) self-improvement via self-play and adaptive opponent pressure.
|
| 22 |
+
program_themes:
|
| 23 |
+
long_horizon_planning_and_instruction_following: >
|
| 24 |
+
Episodes and scenarios scale to many steps and many playbook instructions, with
|
| 25 |
+
sparse and delayed security and mission rewards. Agents must decompose response
|
| 26 |
+
goals, track partial state and instruction progress, and maintain coherent
|
| 27 |
+
behavior across long trajectories (beyond one-shot or shallow next-step reasoning).
|
| 28 |
+
self_improvement_and_adaptive_curricula: >
|
| 29 |
+
Red versus Blue interaction provides explicit self-play over a defined family of
|
| 30 |
+
cyber-defense tasks. SFT, GRPO, and league training (PFSP, PSRO, and mixed
|
| 31 |
+
meta-scheduling) vary opponent mix and round pressure, yielding adaptive-curriculum
|
| 32 |
+
style learning and recursive policy improvement on the same environment interface.
|
| 33 |
+
task_type: "sequential_decision_making"
|
| 34 |
+
horizon:
|
| 35 |
+
min_steps: 60
|
| 36 |
+
max_steps: 180
|
| 37 |
+
scenarios:
|
| 38 |
+
- name: "small"
|
| 39 |
+
turns: 60
|
| 40 |
+
instructions: 40
|
| 41 |
+
checkpoint_stride: 8
|
| 42 |
+
- name: "medium"
|
| 43 |
+
turns: 100
|
| 44 |
+
instructions: 120
|
| 45 |
+
checkpoint_stride: 12
|
| 46 |
+
- name: "large"
|
| 47 |
+
turns: 180
|
| 48 |
+
instructions: 300
|
| 49 |
+
checkpoint_stride: 20
|
| 50 |
+
agents:
|
| 51 |
+
red:
|
| 52 |
+
role: "attacker"
|
| 53 |
+
objective: "maximize foothold/privilege/lateral movement/exfiltration while avoiding detection"
|
| 54 |
+
blue:
|
| 55 |
+
role: "defender"
|
| 56 |
+
objective: "detect/contain/recover while completing ordered mission instructions"
|
| 57 |
+
observation_space:
|
| 58 |
+
red: "partial observability over attack-relevant state and outcomes"
|
| 59 |
+
blue: "partial observability over defense state, mission context, and progress metadata"
|
| 60 |
+
action_space:
|
| 61 |
+
red: "structured cyber actions for adversarial operations"
|
| 62 |
+
blue: "structured CyberAction JSON tool calls"
|
| 63 |
+
reward_model:
|
| 64 |
+
type: "multi-component"
|
| 65 |
+
notes:
|
| 66 |
+
- "dense + delayed terms"
|
| 67 |
+
- "instruction progress/checkpoint/violation shaping"
|
| 68 |
+
- "near-zero-sum coupling with collateral cost term"
|
| 69 |
+
references:
|
| 70 |
+
project_overview: "Main project overview and environment description"
|
| 71 |
+
technical_blog: "Narrative write-up with math, training journey, and results"
|
| 72 |
+
environment_components: "Simulator, rubrics, metrics, scenarios, and tool interfaces"
|
| 73 |
+
training_process: "For full training process details, refer to README.md"
|
| 74 |
+
notebooks:
|
| 75 |
+
- "notebook/SFT_→_GRPO_(Vanilla).ipynb"
|
| 76 |
+
- "notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb"
|
| 77 |
+
- "notebook/League(PFSP).ipynb"
|
| 78 |
+
- 'notebook/League_(PSRO) (1).ipynb'
|
| 79 |
+
- "notebook/League_(PFSP_+_PSRO).ipynb"
|
| 80 |
+
training_paths:
|
| 81 |
+
- "Single-policy SFT to GRPO refinement"
|
| 82 |
+
- "League-based SFT to round-wise GRPO with PFSP/PSRO scheduling"
|
| 83 |
|
| 84 |
server:
|
| 85 |
host: "0.0.0.0"
|
| 86 |
port: 7870
|
| 87 |
workers: 1
|
| 88 |
module: "server.app:app"
|
| 89 |
+
routes_hint:
|
| 90 |
+
- "/health"
|
| 91 |
+
- "/info"
|
| 92 |
+
- "/artifacts"
|
| 93 |
+
api_style: "OpenEnv-compatible FastAPI service"
|
| 94 |
|
| 95 |
features:
|
| 96 |
multi_reward: true
|
| 97 |
prevent_hacking: true
|
| 98 |
curriculum_scheduler: true
|
| 99 |
+
partial_observability: true
|
| 100 |
+
stochastic_dynamics: true
|
| 101 |
+
multi_agent: true
|
| 102 |
+
instruction_tracking: true
|
| 103 |
+
adversarial_interaction: true
|
| 104 |
+
league_training_support: true
|
| 105 |
+
pfsp_support: true
|
| 106 |
+
psro_support: true
|
| 107 |
+
|
| 108 |
+
training:
|
| 109 |
+
primary_pipelines:
|
| 110 |
+
- name: "sft_grpo"
|
| 111 |
+
implementation: "single-policy training path"
|
| 112 |
+
summary: "SFT warm start followed by single-policy GRPO refinement"
|
| 113 |
+
- name: "sft_league_grpo"
|
| 114 |
+
implementation: "league-based training path"
|
| 115 |
+
summary: "SFT + league rounds with PFSP/PSRO/mix opponent scheduling and mini-GRPO updates"
|
| 116 |
+
artifacts:
|
| 117 |
+
common:
|
| 118 |
+
- "training curves"
|
| 119 |
+
- "optimization history logs"
|
| 120 |
+
- "metrics logs"
|
| 121 |
+
- "per-sample reward traces"
|
| 122 |
+
- "per-step visualizations"
|
| 123 |
+
league:
|
| 124 |
+
- "combined multi-round trend curves"
|
| 125 |
+
- "league state trajectory logs"
|
| 126 |
+
- "combined round history logs"
|
| 127 |
+
|
| 128 |
+
evaluation:
|
| 129 |
+
built_in_metrics:
|
| 130 |
+
- "instruction_progress_rate"
|
| 131 |
+
- "instruction_violation_rate"
|
| 132 |
+
- "mttd"
|
| 133 |
+
- "mttr"
|
| 134 |
+
- "exfiltration_pressure"
|
| 135 |
+
- "checkpoint_progress"
|
| 136 |
+
success_characterization:
|
| 137 |
+
- "improved environment-aligned Blue reward after SFT->GRPO"
|
| 138 |
+
- "stable action diversity under anti-collapse shaping"
|
| 139 |
+
- "robustness gains under league opponent variation"
|
pyproject.toml
CHANGED
|
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
|
|
| 5 |
[project]
|
| 6 |
name = "openenv-cyber-selfplay"
|
| 7 |
version = "0.1.0"
|
| 8 |
-
description = "Cyber defense red-vs-blue self-play environment for OpenEnv (Theme 4: self-improvement, Theme 2:
|
| 9 |
readme = "README.md"
|
| 10 |
requires-python = ">=3.10"
|
| 11 |
license = { text = "MIT" }
|
|
|
|
| 5 |
[project]
|
| 6 |
name = "openenv-cyber-selfplay"
|
| 7 |
version = "0.1.0"
|
| 8 |
+
description = "Cyber defense red-vs-blue self-play environment for OpenEnv (Theme 4: self-improvement, Theme 2: multi-step reasoning)."
|
| 9 |
readme = "README.md"
|
| 10 |
requires-python = ">=3.10"
|
| 11 |
license = { text = "MIT" }
|