HarshitShri026 commited on
Commit
06332ca
·
1 Parent(s): 756be0d

Update Blog and Readme

Browse files
Blogs.md CHANGED
@@ -1,8 +1,12 @@
1
- # CyberSelfPlay: Building a Long-Horizon Cyber Defense Environment
 
 
 
 
2
 
3
  ## Why this environment exists
4
 
5
- Most agent benchmarks are short-horizon and mostly single-agent. Real cyber defense is neither:
6
 
7
  - decisions unfold over many steps,
8
  - observations are partial and noisy,
@@ -11,7 +15,13 @@ Most agent benchmarks are short-horizon and mostly single-agent. Real cyber defe
11
 
12
  CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.
13
 
14
- What makes this direction different is that we do not treat “good defense” as a single yes/no check. The agent must keep making good choices for many steps in a row, under changing pressure, while mission goals are still active. That combination (long horizon + partial visibility + active adversary + mission constraints) is where many current benchmarks become too easy or unrealistic.
 
 
 
 
 
 
15
 
16
  ---
17
 
@@ -42,7 +52,7 @@ $$
42
  r_B=-r_R-\lambda C_{\mathrm{collateral}}.
43
  $$
44
 
45
- In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders face in real systems.
46
 
47
  ---
48
 
@@ -55,14 +65,16 @@ At a high level, the system has:
55
  - reward rubrics,
56
  - metrics and progress tracking,
57
  - scenario definitions and tool interfaces.
58
- 2. **API server** (`server/app.py`)
59
  - OpenEnv endpoints for interaction.
60
- 3. **Training scripts** (`train/`)
61
- - `kaggle_grpo.py` (single-policy SFT -> GRPO),
62
- - `kaggle_grpo_league.py` (SFT -> league rounds + mini-GRPO + PFSP/PSRO updates).
63
 
64
  Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.
65
 
 
 
66
  ---
67
 
68
  ## Observations, actions, rewards
@@ -129,7 +141,7 @@ In practice, this looked like “safe but repetitive” behavior: valid JSON, bu
129
 
130
  ### Step 3: Add stabilization in single-policy GRPO
131
 
132
- In `kaggle_grpo.py`, we introduced shaping aligned with this issue:
133
 
134
  - group-level diversity penalty when one tool dominates a batch,
135
  - additional nudge against overusing `execute_instruction` when SFT bias is high,
@@ -141,7 +153,7 @@ This step is important because it addresses a common failure mode in small-model
141
 
142
  ### Step 4: Move to league training for broader robustness
143
 
144
- Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We moved to `kaggle_grpo_league.py`:
145
 
146
  - run multiple league rounds,
147
  - pick Red archetypes using PFSP / PSRO / mix,
@@ -166,12 +178,76 @@ This is where behavior starts to look more “field-like”: the defender is not
166
 
167
  ### Step 5: Turn logs into evidence, not just numbers
168
 
169
- We deliberately kept artifact generation rich (`training_curves.png`, per-step JSONL logs, combined league histories) so claims can be traced back to concrete run outputs. That makes debugging, comparison, and review much more grounded.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
 
171
  ---
172
 
173
  ## Results and evidence
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  Across runs, we observe the expected pattern:
176
 
177
  - Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
@@ -186,14 +262,13 @@ A useful way to read these results is:
186
 
187
  By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.
188
 
189
- Primary artifacts produced by the training scripts:
190
 
191
- - `training_curves.png`
192
- - `log_history.json`
193
- - `train_metrics.log`
194
- - `per_step_rewards.jsonl`
195
- - per-step curves under `curves/`
196
- - league-specific: `training_curves_all_rounds.png`, `league_state.jsonl`, `log_history_combined.json`
197
 
198
  These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.
199
 
@@ -203,12 +278,12 @@ These files are the evidence trail for reward trends, variance, action diversity
203
 
204
  CyberSelfPlay matters because it evaluates what real defenders need:
205
 
206
- - long-horizon, instruction-conditioned recovery,
207
  - adversarial interaction under uncertainty,
208
  - measurable progress beyond one-step task completion.
209
 
210
- For practitioners, it is closer to incident response realities.
211
- For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior.
212
 
213
  For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.
214
  For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.
@@ -217,13 +292,16 @@ For evaluation-focused work, it provides a bridge between toy tasks and operatio
217
 
218
  ## Why this submission can stand out
219
 
220
- - It tackles a hard setting that combines long horizon, partial observability, adversarial play, and mission objectives in one benchmark.
221
- - It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure.
222
- - It includes mathematical grounding, system-level structure, and artifact-level evidence in one coherent package.
223
- - The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible.
224
 
225
  ---
226
 
227
- ## Environment link
 
 
 
228
 
229
- - Hugging Face Space: `https://huggingface.co/spaces/HarshitShri026`
 
1
+ # CyberSelfPlay: Building a Cyber Defense Environment
2
+
3
+ **Important links:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)
4
+
5
+ **Documentation:** the math-led overview, full training table, and repository layout are in the [project README](README.md) (and this file links back to it in [Where to go next](#where-to-go-next)).
6
 
7
  ## Why this environment exists
8
 
9
+ In **real-world** security operations, impact is not a single model score. It is whether a team can run long incident timelines under uncertainty while adversaries adapt. **Industry** and government playbooks for detection, containment, and recovery read like multi-step missions, not one-shot classifiers. Yet most agent benchmarks are short-horizon and mostly single-agent. Cyber defense in practice is neither:
10
 
11
  - decisions unfold over many steps,
12
  - observations are partial and noisy,
 
15
 
16
  CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.
17
 
18
+ What makes this direction **novel** is that we do not treat “good defense” as a single yes/no check. The agent must keep making good choices for many steps in a row, under changing pressure, while mission goals are still active. That combination (multi-step behavior + partial visibility + active adversary + mission constraints) is where many current benchmarks become too easy or unrealistic, and where **industry**-relevant **impact** is actually decided.
19
+
20
+ ### How this lines up with long-horizon and self-improvement themes
21
+
22
+ **Theme: (super) long-horizon planning and instruction following.** Missions are **long-running** by design: scenarios scale to **many** instructions and checkpoints, with **sparse and delayed** rewards from security and mission rubrics. The agent must **decompose** response goals, **track** state and playbook progress under partial visibility, and **recover** from early mistakes over **extended trajectories**—closer to durable planning than one-shot next responses.
23
+
24
+ **Theme: self-improvement and adaptive curricula.** The **Red vs. Blue** loop is explicit **self-play** over a **defined** scenario family. **League** work (PFSP, PSRO, and mixed) plus round-based **GRPO** changes the **opponent mix** and pressure across training, so improvement is not fitting a static list of tasks but **recursive capability growth** driven by an **adaptive curriculum** and interaction feedback on the same environment.
25
 
26
  ---
27
 
 
52
  r_B=-r_R-\lambda C_{\mathrm{collateral}}.
53
  $$
54
 
55
+ In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders see in **real-world** response and the trade-offs that show up in **industry** debriefs.
56
 
57
  ---
58
 
 
65
  - reward rubrics,
66
  - metrics and progress tracking,
67
  - scenario definitions and tool interfaces.
68
+ 2. **API server**
69
  - OpenEnv endpoints for interaction.
70
+ 3. **Training pipelines**
71
+ - a single-policy path (SFT -> GRPO),
72
+ - and a league-based path (SFT -> rounds + mini-GRPO + PFSP/PSRO updates).
73
 
74
  Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.
75
 
76
+ **System figures (open as links or view inline in [Colab, diagrams, and repository notebooks](#colab-diagrams-and-repository-notebooks)):** the [**environment architecture** diagram (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) shows how the environment, server, and training stack connect; the [**end-to-end training flow** (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) summarizes SFT, GRPO, and league training at a glance. Placing the links here matches how engineers skim a project: first the shape of the system, then the pipeline.
77
+
78
  ---
79
 
80
  ## Observations, actions, rewards
 
141
 
142
  ### Step 3: Add stabilization in single-policy GRPO
143
 
144
+ In the single-policy training path, we introduced shaping aligned with this issue:
145
 
146
  - group-level diversity penalty when one tool dominates a batch,
147
  - additional nudge against overusing `execute_instruction` when SFT bias is high,
 
153
 
154
  ### Step 4: Move to league training for broader robustness
155
 
156
+ Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We then moved to a league-based training loop:
157
 
158
  - run multiple league rounds,
159
  - pick Red archetypes using PFSP / PSRO / mix,
 
178
 
179
  ### Step 5: Turn logs into evidence, not just numbers
180
 
181
+ We deliberately kept artifact generation rich (training curves, per-step logs, and combined league histories) so claims can be traced back to concrete run outputs. That makes debugging, comparison, and review much more grounded.
182
+
183
+ ---
184
+
185
+ ## Colab, diagrams, and repository notebooks
186
+
187
+ The README documents the same training recipes with **public Colab** links and **static curve images** (repeated below under [Results and evidence](#results-and-evidence)). In the repo, the `notebook/` directory holds local copies aligned with each recipe.
188
+
189
+ ### Environment diagrams (from the README)
190
+
191
+ These SVGs are the high-level system view and training pipeline, as in the [README `Environment Architecture` and `Training Flow` sections](README.md#environment-architecture). You can open each asset directly: [**architecture (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) · [**training flow (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg).
192
+
193
+ **Architecture**
194
+
195
+ [Open architecture diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg)
196
+
197
+ <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg" width="800" alt="CyberSelfPlay environment architecture" />
198
+
199
+ **Training flow**
200
+
201
+ [Open training flow diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg)
202
+
203
+ <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg" width="800" alt="Training flow from SFT to GRPO and league" />
204
+
205
+ ### Colab notebooks and what each path does
206
+
207
+ | Method | Open in Colab | Local notebook in `notebook/` | In short |
208
+ |--------|----------------|-------------------------------|----------|
209
+ | **SFT → GRPO (Vanilla)** | [Open in Colab](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | `SFT_→_GRPO_(Vanilla).ipynb` | Supervised fine-tuning on trajectory-style data, then **vanilla GRPO** with the environment reward only: the baseline for single-policy learning. |
210
+ | **SFT → GRPO (Anti-Collapse)** | [Open in Colab](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | `SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb` | Same SFT + GRPO stack with **diversity / anti-collapse** regularization so the policy does not collapse to a tiny set of tool actions. |
211
+ | **League (PFSP)** | [Open in Colab](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | `League(PFSP).ipynb` | **League** training with **Prioritized Fictitious Self-Play**: opponents are sampled with weights tied to matchups, so the defender faces a shifting mixture of Red styles. |
212
+ | **League (PSRO)** | [Open in Colab](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | `League_(PSRO) (1).ipynb` | League loop using **PSRO-style** meta-updates on a population of policies (response oracles) rather than only PFSP sampling. |
213
+ | **League (PFSP + PSRO)** | [Open in Colab](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | `League_(PFSP_+_PSRO).ipynb` | **Combined** path: PFSP for opponent (policy) choice plus PSRO-style weighting so sampling and meta-game updates run together. |
214
+
215
+ The notebooks mirror the table in the [README `Training Approaches` section](README.md#-training-approaches-in-this-project); Colab is the shareable run surface, and the `notebook/` files are the offline copies in this repository.
216
 
217
  ---
218
 
219
  ## Results and evidence
220
 
221
+ ### Figures from training runs (same assets as the README)
222
+
223
+ Below are the **SFT / GRPO / league** curve figures linked from the README’s training table, plus the **SFT training loss** plot referenced for this write-up. Together they are the main visual evidence for convergence and per-method behavior.
224
+
225
+ **SFT training loss (cross-entropy on expert trajectories).** The run shows a clean optimization trajectory: loss starts around **3.2–3.3**, stays almost flat for the first few steps, then falls steeply from roughly step **5** through **25**. After that the curve flattens: from about step **30** onward training loss sits near **0.1** (steps on the x-axis go up to about **37**), which indicates that the SFT stage has found a low-NLL fit on the demonstration data.
226
+
227
+ <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193812/image_8_s3tzys.png" width="700" alt="SFT training loss vs steps" />
228
+
229
+ **SFT → GRPO (Vanilla).**
230
+
231
+ <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="700" alt="SFT to GRPO Vanilla metrics" />
232
+
233
+ **SFT → GRPO (Anti-Collapse).**
234
+
235
+ <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="700" alt="SFT to GRPO with anti-collapse regularization" />
236
+
237
+ **League (PFSP).**
238
+
239
+ <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="700" alt="League PFSP training curves" />
240
+
241
+ **League (PSRO).**
242
+
243
+ <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="700" alt="League PSRO training curves" />
244
+
245
+ **League (PFSP + PSRO).**
246
+
247
+ <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="700" alt="League PFSP and PSRO combined" />
248
+
249
+ ### Interpretation in one pass
250
+
251
  Across runs, we observe the expected pattern:
252
 
253
  - Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
 
262
 
263
  By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.
264
 
265
+ Primary artifacts produced during training:
266
 
267
+ - consolidated training curves
268
+ - full optimization history logs
269
+ - per-step reward traces
270
+ - per-step behavior snapshots
271
+ - league-specific multi-round trend and meta-state reports
 
272
 
273
  These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.
274
 
 
278
 
279
  CyberSelfPlay matters because it evaluates what real defenders need:
280
 
281
+ - multi-step, instruction-conditioned recovery,
282
  - adversarial interaction under uncertainty,
283
  - measurable progress beyond one-step task completion.
284
 
285
+ For **industry** practitioners, it is closer to incident response realities and to how blue teams think about time-to-detect, containment, and recovery.
286
+ For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior, with a **novel** mix of instruction following, tools, and adversarial pressure in one environment.
287
 
288
  For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.
289
  For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.
 
292
 
293
  ## Why this submission can stand out
294
 
295
+ - It tackles a **real-world**-tilted setting that combines multi-step behavior, partial observability, adversarial play, and mission objectives in one benchmark, which is an unusual and **impact**-relevant target for the field.
296
+ - It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure, with clear **industry**-minded artifacts (curves, logs, league history).
297
+ - It includes mathematical grounding, system-level structure, and diagram-level **novelty** in how the stack is presented (see [architecture](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) and [training flow](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) links in [Core components](#core-components) and [diagrams](#environment-diagrams-from-the-readme)), plus artifact-level evidence in one coherent package.
298
+ - The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible, which is what teams need to trust deployment-related claims.
299
 
300
  ---
301
 
302
+ ## Where to go next
303
+
304
+ - **Project README** (formal POSG, rewards, training math, and full method table): [README.md](README.md)
305
+ - **Hugging Face Space (live environment):** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026)
306
 
307
+ The README and this blog point to each other so you can move between the specification-style overview and the narrative plus figures here.
README.md CHANGED
@@ -1,5 +1,5 @@
1
- ---
2
- title: CyberSelfPlay (Long-Horizon Cyber POSG)
3
  emoji: 🛡️
4
  colorFrom: blue
5
  colorTo: red
@@ -10,17 +10,22 @@ pinned: true
10
 
11
  # CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
12
 
13
- CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for long-horizon cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.
 
 
14
 
15
  ## Environment on Hugging Face Space
16
 
17
- - **Live Space:** `https://huggingface.co/spaces/HarshitShri026`
 
18
 
19
  ---
20
 
21
  ## Problem and Capability Gap
22
 
23
- Most agent benchmarks are short-horizon and single-agent. Cyber defense in practice is long-horizon, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.
 
 
24
 
25
  ---
26
 
@@ -108,7 +113,7 @@ r_B &= v_1 \mathbb{1}_{\mathrm{detect}} + v_2 \mathbb{1}_{\mathrm{contain}} + v_
108
  \end{aligned}
109
  $$
110
 
111
- Concrete rubric implementation is in `cyber_selfplay_env/rubrics.py`.
112
 
113
  ---
114
 
@@ -137,9 +142,9 @@ We experiment across **SFT + GRPO baselines**, **reward smoothing**, **diversity
137
  | **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="350"/>|
138
  | **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="350"/> |
139
  | **🔹 League (Multi-Policy RL)** ||||
140
- | **League (PFSP)** | Prioritized Fictitious Self-Play for opponent sampling | [Open](https://colab.research.google.com/drive/1mDk9pzeRudjmXhU0VBVJymqF5An8bHhk?usp=sharing) | Win-rate curves |
141
- | **League (PSRO)** | Policy-Space Response Oracles (game-theoretic updates) | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188537/League_PSRO_wd3esy.png" width="350"/> |
142
- | **League (PFSP + PSRO)** | Combines adaptive sampling + meta-policy optimization | [Open](https://colab.research.google.com/drive/1OaOQYmoq2ni2FjCUukBkpt3BpT55uhX9?usp=sharing) | Meta + Reward curves |
143
 
144
  ---
145
 
@@ -239,7 +244,7 @@ Score each completion with reward $R^{(j)}$, compute group-relative advantages,
239
 
240
  ---
241
 
242
- ## Long-Horizon Scenario Scale
243
 
244
  | scenario | turns | instructions | checkpoint stride |
245
  | --- | ---: | ---: | ---: |
@@ -255,20 +260,20 @@ Instruction progress and violation signals are tracked in environment metadata.
255
 
256
  Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, round-level opponent selection (PFSP / PSRO / mix) changes pressure distribution and produces distinct multi-round learning dynamics.
257
 
258
- Common result artifacts produced by the training scripts include:
259
 
260
- - `training_curves.png`
261
- - `log_history.json`
262
- - `train_metrics.log`
263
- - `per_step_rewards.jsonl`
264
- - per-step curve images under `curves/`
265
- - league-specific outputs such as `training_curves_all_rounds.png`, `league_state.jsonl`, and `log_history_combined.json`
266
 
267
  ---
268
 
269
  ## Why It Matters
270
 
271
- - **Security operations relevance:** models long-horizon defense decisions closer to real incident response.
272
  - **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty.
273
  - **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes.
274
 
 
1
+ ---
2
+ title: CyberSelfPlay (Cyber POSG)
3
  emoji: 🛡️
4
  colorFrom: blue
5
  colorTo: red
 
10
 
11
  # CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
12
 
13
+ **Important links:** [League (PFSP + PSRO) Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)
14
+
15
+ CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.
16
 
17
  ## Environment on Hugging Face Space
18
 
19
+ - **Live Space:** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026)
20
+ - **Narrative, Colab context, and results figures:** [Blogs](Blogs.md)
21
 
22
  ---
23
 
24
  ## Problem and Capability Gap
25
 
26
+ Most agent benchmarks are short and single-agent. Cyber defense in practice is multi-step, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.
27
+
28
+ **Connection to long-horizon and self-play themes:** the setting stresses **(super) long-horizon planning and instruction following**—episodes with many steps, many playbook instructions, and security rewards that are often sparse or delayed, so the agent must track state, recover from mis-steps, and keep coherent plans across long runs. It also supports **self-improvement through interaction**: the training stack uses **SFT → GRPO** and **league (PFSP / PSRO / mix)** to keep pressure adaptive—opponents and rounds change, so the LLM policy is not tuned on a static task set but on an evolving, self-play–style curriculum over the same family of tasks.
29
 
30
  ---
31
 
 
113
  \end{aligned}
114
  $$
115
 
116
+ The reward rubric is implemented directly in the environment’s scoring logic.
117
 
118
  ---
119
 
 
142
  | **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="350"/>|
143
  | **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="350"/> |
144
  | **🔹 League (Multi-Policy RL)** ||||
145
+ | **League (PFSP)** | Prioritized Fictitious Self-Play for opponent sampling | [Open](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="350"/> |
146
+ | **League (PSRO)** | Policy-Space Response Oracles (game-theoretic updates) | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="350"/> |
147
+ | **League (PFSP + PSRO)** | Combines adaptive sampling + meta-policy optimization | [Open](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="350"/> |
148
 
149
  ---
150
 
 
244
 
245
  ---
246
 
247
+ ## Scenario Scale
248
 
249
  | scenario | turns | instructions | checkpoint stride |
250
  | --- | ---: | ---: | ---: |
 
260
 
261
  Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, round-level opponent selection (PFSP / PSRO / mix) changes pressure distribution and produces distinct multi-round learning dynamics.
262
 
263
+ Common result artifacts produced by training include:
264
 
265
+ - consolidated training curves,
266
+ - step-by-step optimization history,
267
+ - metrics logs,
268
+ - per-sample reward traces,
269
+ - per-step visualization snapshots,
270
+ - and, for league experiments, combined multi-round trend and meta-state reports.
271
 
272
  ---
273
 
274
  ## Why It Matters
275
 
276
+ - **Security operations relevance:** models multi-step defense decisions closer to real incident response.
277
  - **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty.
278
  - **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes.
279
 
cyber_selfplay_env/simulator.py CHANGED
@@ -159,7 +159,7 @@ class CyberSimulator:
159
  pending = [x for x in mission["instructions"] if not x["done"]]
160
  if pending:
161
  current = pending[0]
162
- # Requires matching tool in params to model long-horizon instruction following.
163
  requested_tool = ""
164
  if params and isinstance(params.get("required_tool"), str):
165
  requested_tool = params["required_tool"]
 
159
  pending = [x for x in mission["instructions"] if not x["done"]]
160
  if pending:
161
  current = pending[0]
162
+ # Requires matching tool in params to model multi-step instruction following.
163
  requested_tool = ""
164
  if params and isinstance(params.get("required_tool"), str):
165
  requested_tool = params["required_tool"]
notebook/League(PFSP).ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebook/League_(PFSP_+_PSRO).ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebook/League_(PSRO) (1).ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebook/SFT_→_GRPO_(Vanilla).ipynb ADDED
The diff for this file is too large to render. See raw diff
 
openenv.yaml CHANGED
@@ -1,16 +1,139 @@
1
  env:
2
  name: "CyberSelfPlay"
3
- author: "Hackathon Team"
4
- description: "Long-horizon red-vs-blue cyber POSG with adaptive self-play curriculum and delayed instruction-following rewards."
5
  version: "0.1.0"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  server:
8
  host: "0.0.0.0"
9
  port: 7870
10
  workers: 1
11
  module: "server.app:app"
 
 
 
 
 
12
 
13
  features:
14
  multi_reward: true
15
  prevent_hacking: true
16
  curriculum_scheduler: true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  env:
2
  name: "CyberSelfPlay"
3
+ author: "Team Neuron"
4
+ description: "Long-horizon Red-vs-Blue cyber defense POSG with partial observability, stochastic transitions, mission-instruction progress signals, and league-style opponent pressure for robust policy learning."
5
  version: "0.1.0"
6
+ homepage: "https://huggingface.co/spaces/HarshitShri026"
7
+ domain: "cyber-defense"
8
+ tags:
9
+ - "openenv"
10
+ - "cybersecurity"
11
+ - "red-vs-blue"
12
+ - "multi-agent"
13
+ - "multi-step"
14
+ - "partially-observable"
15
+ - "instruction-following"
16
+ - "reinforcement-learning"
17
+ - "long-horizon"
18
+ - "self-play"
19
+ - "adaptive-curriculum"
20
+ # Aligns with program themes: (1) long-horizon planning & instruction following;
21
+ # (2) self-improvement via self-play and adaptive opponent pressure.
22
+ program_themes:
23
+ long_horizon_planning_and_instruction_following: >
24
+ Episodes and scenarios scale to many steps and many playbook instructions, with
25
+ sparse and delayed security and mission rewards. Agents must decompose response
26
+ goals, track partial state and instruction progress, and maintain coherent
27
+ behavior across long trajectories (beyond one-shot or shallow next-step reasoning).
28
+ self_improvement_and_adaptive_curricula: >
29
+ Red versus Blue interaction provides explicit self-play over a defined family of
30
+ cyber-defense tasks. SFT, GRPO, and league training (PFSP, PSRO, and mixed
31
+ meta-scheduling) vary opponent mix and round pressure, yielding adaptive-curriculum
32
+ style learning and recursive policy improvement on the same environment interface.
33
+ task_type: "sequential_decision_making"
34
+ horizon:
35
+ min_steps: 60
36
+ max_steps: 180
37
+ scenarios:
38
+ - name: "small"
39
+ turns: 60
40
+ instructions: 40
41
+ checkpoint_stride: 8
42
+ - name: "medium"
43
+ turns: 100
44
+ instructions: 120
45
+ checkpoint_stride: 12
46
+ - name: "large"
47
+ turns: 180
48
+ instructions: 300
49
+ checkpoint_stride: 20
50
+ agents:
51
+ red:
52
+ role: "attacker"
53
+ objective: "maximize foothold/privilege/lateral movement/exfiltration while avoiding detection"
54
+ blue:
55
+ role: "defender"
56
+ objective: "detect/contain/recover while completing ordered mission instructions"
57
+ observation_space:
58
+ red: "partial observability over attack-relevant state and outcomes"
59
+ blue: "partial observability over defense state, mission context, and progress metadata"
60
+ action_space:
61
+ red: "structured cyber actions for adversarial operations"
62
+ blue: "structured CyberAction JSON tool calls"
63
+ reward_model:
64
+ type: "multi-component"
65
+ notes:
66
+ - "dense + delayed terms"
67
+ - "instruction progress/checkpoint/violation shaping"
68
+ - "near-zero-sum coupling with collateral cost term"
69
+ references:
70
+ project_overview: "Main project overview and environment description"
71
+ technical_blog: "Narrative write-up with math, training journey, and results"
72
+ environment_components: "Simulator, rubrics, metrics, scenarios, and tool interfaces"
73
+ training_process: "For full training process details, refer to README.md"
74
+ notebooks:
75
+ - "notebook/SFT_→_GRPO_(Vanilla).ipynb"
76
+ - "notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb"
77
+ - "notebook/League(PFSP).ipynb"
78
+ - 'notebook/League_(PSRO) (1).ipynb'
79
+ - "notebook/League_(PFSP_+_PSRO).ipynb"
80
+ training_paths:
81
+ - "Single-policy SFT to GRPO refinement"
82
+ - "League-based SFT to round-wise GRPO with PFSP/PSRO scheduling"
83
 
84
  server:
85
  host: "0.0.0.0"
86
  port: 7870
87
  workers: 1
88
  module: "server.app:app"
89
+ routes_hint:
90
+ - "/health"
91
+ - "/info"
92
+ - "/artifacts"
93
+ api_style: "OpenEnv-compatible FastAPI service"
94
 
95
  features:
96
  multi_reward: true
97
  prevent_hacking: true
98
  curriculum_scheduler: true
99
+ partial_observability: true
100
+ stochastic_dynamics: true
101
+ multi_agent: true
102
+ instruction_tracking: true
103
+ adversarial_interaction: true
104
+ league_training_support: true
105
+ pfsp_support: true
106
+ psro_support: true
107
+
108
+ training:
109
+ primary_pipelines:
110
+ - name: "sft_grpo"
111
+ implementation: "single-policy training path"
112
+ summary: "SFT warm start followed by single-policy GRPO refinement"
113
+ - name: "sft_league_grpo"
114
+ implementation: "league-based training path"
115
+ summary: "SFT + league rounds with PFSP/PSRO/mix opponent scheduling and mini-GRPO updates"
116
+ artifacts:
117
+ common:
118
+ - "training curves"
119
+ - "optimization history logs"
120
+ - "metrics logs"
121
+ - "per-sample reward traces"
122
+ - "per-step visualizations"
123
+ league:
124
+ - "combined multi-round trend curves"
125
+ - "league state trajectory logs"
126
+ - "combined round history logs"
127
+
128
+ evaluation:
129
+ built_in_metrics:
130
+ - "instruction_progress_rate"
131
+ - "instruction_violation_rate"
132
+ - "mttd"
133
+ - "mttr"
134
+ - "exfiltration_pressure"
135
+ - "checkpoint_progress"
136
+ success_characterization:
137
+ - "improved environment-aligned Blue reward after SFT->GRPO"
138
+ - "stable action diversity under anti-collapse shaping"
139
+ - "robustness gains under league opponent variation"
pyproject.toml CHANGED
@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
5
  [project]
6
  name = "openenv-cyber-selfplay"
7
  version = "0.1.0"
8
- description = "Cyber defense red-vs-blue self-play environment for OpenEnv (Theme 4: self-improvement, Theme 2: long-horizon reasoning)."
9
  readme = "README.md"
10
  requires-python = ">=3.10"
11
  license = { text = "MIT" }
 
5
  [project]
6
  name = "openenv-cyber-selfplay"
7
  version = "0.1.0"
8
+ description = "Cyber defense red-vs-blue self-play environment for OpenEnv (Theme 4: self-improvement, Theme 2: multi-step reasoning)."
9
  readme = "README.md"
10
  requires-python = ">=3.10"
11
  license = { text = "MIT" }