File size: 22,482 Bytes
06332ca
 
8baba8f
922c569
3d9ad24
06332ca
35dad3a
f19a15d
 
 
06332ca
f19a15d
 
 
 
 
 
 
 
06332ca
 
 
 
 
 
 
f19a15d
 
 
 
 
8baba8f
f19a15d
8baba8f
 
 
 
 
f19a15d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06332ca
f19a15d
 
 
 
 
 
 
 
 
 
 
 
06332ca
f19a15d
06332ca
8baba8f
 
f19a15d
 
 
06332ca
 
f19a15d
 
 
 
 
 
8baba8f
 
 
 
 
 
 
 
 
 
f19a15d
 
 
8baba8f
f19a15d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06332ca
f19a15d
 
 
 
 
 
 
 
 
 
 
8baba8f
f19a15d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06332ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8baba8f
 
 
06332ca
 
f19a15d
 
 
 
 
06332ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f19a15d
 
 
 
 
 
 
 
 
 
 
 
 
 
06332ca
f19a15d
06332ca
 
 
 
 
f19a15d
 
 
 
 
 
 
 
 
06332ca
f19a15d
 
 
06332ca
 
f19a15d
 
 
 
 
 
 
 
06332ca
 
 
 
f19a15d
 
 
06332ca
 
 
35dad3a
f19a15d
06332ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# CyberSelfPlay: Building a Cyber Defense Environment

**Training Script Link:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)

**Play a gamebased on this environment:** [Game](https://openenv-ui.vercel.app/)

**Documentation:** the math-led overview, full training table, and repository layout are in the [project README](README.md) (and this file links back to it in [Where to go next](#where-to-go-next)). For the deployed OpenEnv HTTP API, use the live Space app [https://harshitshri026-cyberselfplay-env.hf.space](https://harshitshri026-cyberselfplay-env.hf.space/) with interactive docs at [Swagger (`/docs`)](https://harshitshri026-cyberselfplay-env.hf.space/docs) and [ReDoc (`/redoc`)](https://harshitshri026-cyberselfplay-env.hf.space/redoc).

## Why this environment exists

In **real-world** security operations, impact is not a single model score. It is whether a team can run long incident timelines under uncertainty while adversaries adapt. **Industry** and government playbooks for detection, containment, and recovery read like multi-step missions, not one-shot classifiers. Yet most agent benchmarks are short-horizon and mostly single-agent. Cyber defense in practice is neither:

- decisions unfold over many steps,
- observations are partial and noisy,
- an attacker adapts while the defender is acting,
- success is not one move, but sustained containment and recovery.

CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.

What makes this direction **novel** is that we do not treat “good defense” as a single yes/no check. The agent must keep making good choices for many steps in a row, under changing pressure, while mission goals are still active. That combination (multi-step behavior + partial visibility + active adversary + mission constraints) is where many current benchmarks become too easy or unrealistic, and where **industry**-relevant **impact** is actually decided.

### How this lines up with long-horizon and self-improvement themes

**Theme: (super) long-horizon planning and instruction following.** Missions are **long-running** by design: scenarios scale to **many** instructions and checkpoints, with **sparse and delayed** rewards from security and mission rubrics. The agent must **decompose** response goals, **track** state and playbook progress under partial visibility, and **recover** from early mistakes over **extended trajectories**—closer to durable planning than one-shot next responses.

**Theme: self-improvement and adaptive curricula.** The **Red vs. Blue** loop is explicit **self-play** over a **defined** scenario family. **League** work (PFSP, PSRO, and mixed) plus round-based **GRPO** changes the **opponent mix** and pressure across training, so improvement is not fitting a static list of tasks but **recursive capability growth** driven by an **adaptive curriculum** and interaction feedback on the same environment.

---

## What the environment is

CyberSelfPlay is an OpenEnv-compatible two-player environment. **Red** (attacker) and **Blue** (defender) share one hidden world state; each step, one side submits a move, the simulator applies it, and the next observation is for that actor. Blue is the policy you typically train; Red can be scripted, pooled, or treated as a league opponent archetype. Episodes are bounded by a scenario-specific turn budget, with termination on exfil success, or on horizon with a win condition tied to mission instruction completion.

**Blue** chooses from a fixed defender tool set: SIEM and triage, isolation, account controls, secret rotation, patching, hardening, backups, forensics, IOC publishing, and playbook tools such as `execute_instruction` and `checkpoint_plan`—each call names a `tool_name`, an optional `target` (host or asset), and a `params` object for tool-specific arguments, plus an optional `rationale` string. **Red** uses a separate attacker tool set: recon, enumeration, exploit and credential access, pivoting, persistence, exfil pipeline stages, cover-up, and recovery-plan sabotage. Invalid tools return a non-terminal observation with an error in metadata rather than crashing the session.

- Blue receives **partial** public state, noisy telemetry, incident summary fields, and rich **metadata** (reward components, simulator events, POSG metrics, curriculum/scenario tags).
- **CyberAction** is the OpenEnv `Action` type: JSON-shaped, validated per side before the simulator runs.
- Rewards encode both security outcomes (detect/contain/recover) and mission outcomes (instruction progress/checkpoints/violations), with components exposed for logging and for RL shaping.

Formally, the environment is modeled as a partially observable stochastic game (POSG):

$$
\mathcal{G}=\langle \mathcal{S},\mathcal{A}_R,\mathcal{A}_B,\mathcal{O}_R,\mathcal{O}_B,T,Z_R,Z_B,r_R,r_B,\gamma \rangle
$$

with objective

$$
J_i(\pi_i,\pi_{-i})=\mathbb{E}\left[\sum_{t=0}^{H}\gamma^t r_i\left(s_t,a_t^R,a_t^B\right)\right],\quad i\in\{R,B\}.
$$

Near-zero-sum coupling is represented as:

$$
r_B=-r_R-\lambda C_{\mathrm{collateral}}.
$$

In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders see in **real-world** response and the trade-offs that show up in **industry** debriefs.

---

## Core components

At a high level, the system has:

1. **Environment core** (`cyber_selfplay_env/`)
   - hidden-state simulator and transitions,
   - reward rubrics,
   - metrics and progress tracking,
   - scenario definitions and tool interfaces.
2. **API server**
   - OpenEnv endpoints for interaction.
3. **Training pipelines**
   - a single-policy path: **SFT** on demonstration or filtered rollouts, then **GRPO** on environment reward;
   - and a league-based path that uses the **same** SFT and GRPO machinery: an initial SFT stage, then for each **league round** a **mini-GRPO** segment together with **PFSP / PSRO** (or mixed) opponent meta-updates over Red archetypes or policy pools.

Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.

**System figures (open as links or view inline in [Colab, diagrams, and repository notebooks](#colab-diagrams-and-repository-notebooks)):** the [**environment architecture** diagram (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) shows how the environment, server, and training stack connect; the [**end-to-end training flow** (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) summarizes SFT, GRPO, and league training at a glance. Placing the links here matches how engineers skim a project: first the shape of the system, then the pipeline.

---

## Observations, actions, rewards

### Observations

Observations are **player-local**. After each `step`, the `CyberObservation` for the active actor includes:

- **`public_state`**: a partial view of the world. For Blue, this typically includes the simulation time step, a short history of **detections**, **business impact**, a coarse incident count, and **instruction progress** (counts of completed, violated, and total playbook instructions). Red sees a different slice (e.g., limited target hints and detection pressure)—so neither side sees the full state vector.
- **`telemetry`**: a short, possibly noisy list of events (e.g., recent detections for Blue; weak indirect signals for Red), reinforcing partial observability.
- **`incident_summary`**: high-level fields such as termination, winner, exfil status, and time index—usable as a quick situational readout.
- **`reward`**: scalar reward from the **previous** action (after `reset`, typically zero).
- **`done`**: whether the episode has ended.
- **`metadata`**: on successful steps, this carries **decomposed reward components**, raw **events** from the simulator, **POSG metrics** snapshots (e.g., exfil and instruction rates), and **curriculum** information (scenario name, rolling win rates, episode counts). On invalid actions, metadata may contain an **error** string instead.

Reset returns an initial Blue observation with scenario/curriculum hints and a note that play alternates between Red and Blue.

### Actions

The policy does not mutate the simulator with free text: it must emit a **`CyberAction`**: `actor` (`"red"` or `"blue"`), `tool_name` (must be in that side’s allowed set), `target` (e.g., host id), `params` (a dict for tools that need extra fields, such as which capability an `execute_instruction` step requires), and optional `rationale`. Red and Blue each have distinct tool vocabularies; the environment validates the tool name before calling the **CyberSimulator** `step` logic.

### Reward design

The reward law combines dense and delayed signals. The scripts use environment reward as the dominant term, with additional shaping in GRPO training.

Red side:

$$
\begin{aligned}
r_R &= w_1 \mathbb{1}_{\mathrm{foothold}} + w_2 \mathbb{1}_{\mathrm{priv}} + w_3 \mathbb{1}_{\mathrm{lateral}} + w_4 \mathbb{1}_{\mathrm{exfil}} \\
&\quad - w_5 \mathbb{1}_{\mathrm{detect}} + w_6 \mathbb{1}_{\mathrm{plan\_sabotage}} - \eta_R
\end{aligned}
$$

Blue side:

$$
\begin{aligned}
r_B &= v_1 \mathbb{1}_{\mathrm{detect}} + v_2 \mathbb{1}_{\mathrm{contain}} + v_3 \mathbb{1}_{\mathrm{recover}} - v_4 \mathbb{1}_{\mathrm{exfil}} \\
&\quad + v_5 \mathbb{1}_{\mathrm{instr\_progress}} + v_6 \mathbb{1}_{\mathrm{checkpoint}} - v_7 \mathbb{1}_{\mathrm{instr\_violation}} \\
&\quad + v_8 \rho_{\mathrm{inst}} - \eta_B
\end{aligned}
$$

Why this matters: these equations make it hard to “game” the benchmark with shallow tricks. The agent is pushed toward useful defense behavior across time, not just short-term score spikes.

---

## Training story: what we tried, what failed, what changed

### Step 1: Start with SFT -> vanilla GRPO

We started with the direct path:

- Supervised Fine-Tuning (SFT) on heuristic rollouts,
- then Group Relative Policy Optimization (GRPO) using environment reward.

Core GRPO intuition:

$$
\{y^{(1)},\ldots,y^{(G)}\}\sim \pi_\theta(\cdot\mid x)
$$

Sample a group of completions per prompt, score each completion, compute group-relative advantages, and update policy parameters.

At this point, we had a baseline that could parse actions and follow structure, but behavior quality still depended heavily on exploration quality.

### Step 2: We hit mode collapse pressure

During early iterations, action diversity degraded: one tool could dominate generated actions. This reduces effective exploration and hurts credit assignment.

In practice, this looked like “safe but repetitive” behavior: valid JSON, but less tactical variety. The agent was syntactically correct more often than strategically useful.

### Step 3: Add stabilization in single-policy GRPO

In the single-policy training path, we introduced shaping aligned with this issue:

- group-level diversity penalty when one tool dominates a batch,
- additional nudge against overusing `execute_instruction` when SFT bias is high,
- continued logging of unique-tools-per-step as a diversity meter.

This improved robustness of training dynamics while keeping environment reward primary.

This step is important because it addresses a common failure mode in small-model RL: if action variety collapses too early, learning plateaus quickly.

### Step 4: Move to league training for broader robustness

Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We then moved to a league-based training loop that still builds on **SFT** and **GRPO**—not only meta-game structure. A typical run starts with the same **SFT** style initialization used in the non-league recipes, then repeats for **multiple league rounds**: sample or weight **Red** opponents (PFSP, PSRO, or a **mix**), build rollouts and prompts in line with the sampled profile, and run a **mini-GRPO** phase per round on environment reward (with the same family of Unsloth/LoRA/TRL tooling as the vanilla pipeline). **PFSP** and **PSRO** (or their combination) update who gets sampled or how policy weights evolve across rounds, while the inner learning signal remains **SFT-initialized policy improvement via GRPO** under changing opponent pressure. Round-level updates use **replicator**-style or win-rate-weighted rules so the training mixture stays adaptive.

PFSP weighting:

$$
p_j \propto f(w_j),\qquad f(w)=w(1-w)
$$

PSRO-style replicator update:

$$
p_i' \propto p_i\left(1+\eta(u_i-\bar{u})\right),\qquad \bar{u}=\sum_i p_i u_i
$$

This made training pressure adaptive across rounds rather than fixed to one opponent profile.

This is where behavior starts to look more “field-like”: the defender is not tuned to one attacker template, but pushed by a moving mixture of attacker styles.

### Step 5: Turn logs into evidence, not just numbers

We deliberately kept artifact generation rich (training curves, per-step logs, and combined league histories) so claims can be traced back to concrete run outputs. That makes debugging, comparison, and review much more grounded.

---

## Colab, diagrams, and repository notebooks

The README documents the same training recipes with **public Colab** links and **static curve images** (repeated below under [Results and evidence](#results-and-evidence)). In the repo, the `notebook/` directory holds local copies aligned with each recipe.

### Environment diagrams (from the README)

These SVGs are the high-level system view and training pipeline, as in the [README `Environment Architecture` and `Training Flow` sections](README.md#environment-architecture). You can open each asset directly: [**architecture (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) · [**training flow (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg).

**Architecture**

[Open architecture diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg)

<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg" width="800" alt="CyberSelfPlay environment architecture" />

**Training flow**

[Open training flow diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg)

<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg" width="800" alt="Training flow from SFT to GRPO and league" />

### Colab notebooks and what each path does

| Method | Open in Colab | Local notebook in `notebook/` | In short |
|--------|----------------|-------------------------------|----------|
| **SFT → GRPO (Vanilla)** | [Open in Colab](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | `SFT_→_GRPO_(Vanilla).ipynb` | Supervised fine-tuning on trajectory-style data, then **vanilla GRPO** with the environment reward only: the baseline for single-policy learning. |
| **SFT → GRPO (Anti-Collapse)** | [Open in Colab](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | `SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb` | Same SFT + GRPO stack with **diversity / anti-collapse** regularization so the policy does not collapse to a tiny set of tool actions. |
| **League (PFSP)** | [Open in Colab](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | `League(PFSP).ipynb` | **SFT** warm-start and per-round **mini-GRPO** on environment reward, with **Prioritized Fictitious Self-Play** so Red-style opponents are sampled with matchup-driven weights. |
| **League (PSRO)** | [Open in Colab](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | `League_(PSRO) (1).ipynb` | **SFT** and **GRPO** as above, with **PSRO**-style meta-updates on the opponent or policy pool in addition to league sampling. |
| **League (PFSP + PSRO)** | [Open in Colab](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | `League_(PFSP_+_PSRO).ipynb` | **SFT**, **GRPO**, and **PFSP** + **PSRO** together: opponent (policy) choice and replicator-style meta-updates in one loop. |

The notebooks mirror the table in the [README `Training Approaches` section](README.md#-training-approaches-in-this-project); Colab is the shareable run surface, and the `notebook/` files are the offline copies in this repository.

---

## Results and evidence

### Figures from training runs (same assets as the README)

Below are the **SFT / GRPO / league** curve figures linked from the README’s training table, plus the **SFT training loss** plot referenced for this write-up. Together they are the main visual evidence for convergence and per-method behavior.

**SFT training loss (cross-entropy on expert trajectories).** The run shows a clean optimization trajectory: loss starts around **3.2–3.3**, stays almost flat for the first few steps, then falls steeply from roughly step **5** through **25**. After that the curve flattens: from about step **30** onward training loss sits near **0.1** (steps on the x-axis go up to about **37**), which indicates that the SFT stage has found a low-NLL fit on the demonstration data.

<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193812/image_8_s3tzys.png" width="700" alt="SFT training loss vs steps" />

**SFT → GRPO (Vanilla).**

<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="700" alt="SFT to GRPO Vanilla metrics" />

**SFT → GRPO (Anti-Collapse).**

<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="700" alt="SFT to GRPO with anti-collapse regularization" />

**League (PFSP).**

<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="700" alt="League PFSP training curves" />

**League (PSRO).**

<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="700" alt="League PSRO training curves" />

**League (PFSP + PSRO).**

<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="700" alt="League PFSP and PSRO combined" />

### Interpretation in one pass

Across runs, we observe the expected pattern:

- Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
- Diversity shaping reduces collapse and stabilizes learning in single-policy training.
- League mode (PFSP/PSRO/mix) produces richer multi-round dynamics and better robustness across opponent types.

A useful way to read these results is:

1. **Can the model act in valid structured form?** (SFT gives this base)
2. **Can it improve through interaction feedback?** (GRPO gives this climb)
3. **Can it hold up under varied opponents?** (league rounds test this directly)

By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.

Primary artifacts produced during training:

- consolidated training curves
- full optimization history logs
- per-step reward traces
- per-step behavior snapshots
- league-specific multi-round trend and meta-state reports

These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.

---

## Why it matters

CyberSelfPlay matters because it evaluates what real defenders need:

- multi-step, instruction-conditioned recovery,
- adversarial interaction under uncertainty,
- measurable progress beyond one-step task completion.

For **industry** practitioners, it is closer to incident response realities and to how blue teams think about time-to-detect, containment, and recovery.  
For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior, with a **novel** mix of instruction following, tools, and adversarial pressure in one environment.

For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.  
For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.

---

## Why this submission can stand out

- It tackles a **real-world**-tilted setting that combines multi-step behavior, partial observability, adversarial play, and mission objectives in one benchmark, which is an unusual and **impact**-relevant target for the field.
- It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure, with clear **industry**-minded artifacts (curves, logs, league history).
- It includes mathematical grounding, system-level structure, and diagram-level **novelty** in how the stack is presented (see [architecture](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) and [training flow](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) links in [Core components](#core-components) and [diagrams](#environment-diagrams-from-the-readme)), plus artifact-level evidence in one coherent package.
- The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible, which is what teams need to trust deployment-related claims.

---

## Where to go next

- **Project README** (formal POSG, rewards, training math, and full method table): [README.md](README.md)
- **Hugging Face Space (live environment):** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026) — running API: [harshitshri026-cyberselfplay-env.hf.space](https://harshitshri026-cyberselfplay-env.hf.space/), [Swagger `/docs`](https://harshitshri026-cyberselfplay-env.hf.space/docs), [ReDoc `/redoc`](https://harshitshri026-cyberselfplay-env.hf.space/redoc)

The README and this blog point to each other so you can move between the specification-style overview and the narrative plus figures here.