Spaces:

HarshitShri026
/

cyberselfplay-env

Running on CPU Upgrade

App Files Files Community

HarshitShri026 commited on Apr 26

Commit

06332ca

1 Parent(s): 756be0d

Update Blog and Readme

Browse files

Files changed (10) hide show

Blogs.md +105 -27
README.md +23 -18
cyber_selfplay_env/simulator.py +1 -1
notebook/League(PFSP).ipynb +0 -0
notebook/League_(PFSP_+_PSRO).ipynb +0 -0
notebook/League_(PSRO) (1).ipynb +0 -0
notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb +0 -0
notebook/SFT_→_GRPO_(Vanilla).ipynb +0 -0
openenv.yaml +125 -2
pyproject.toml +1 -1

Blogs.md CHANGED Viewed

@@ -1,8 +1,12 @@
-# CyberSelfPlay: Building a Long-Horizon Cyber Defense Environment
 ## Why this environment exists
-Most agent benchmarks are short-horizon and mostly single-agent. Real cyber defense is neither:
 - decisions unfold over many steps,
 - observations are partial and noisy,
@@ -11,7 +15,13 @@ Most agent benchmarks are short-horizon and mostly single-agent. Real cyber defe
 CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.
-What makes this direction different is that we do not treat “good defense” as a single yes/no check. The agent must keep making good choices for many steps in a row, under changing pressure, while mission goals are still active. That combination (long horizon + partial visibility + active adversary + mission constraints) is where many current benchmarks become too easy or unrealistic.
 ---
@@ -42,7 +52,7 @@ $$
 r_B=-r_R-\lambda C_{\mathrm{collateral}}.
 $$
-In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders face in real systems.
 ---
@@ -55,14 +65,16 @@ At a high level, the system has:
    - reward rubrics,
    - metrics and progress tracking,
    - scenario definitions and tool interfaces.
-2. **API server** (`server/app.py`)
    - OpenEnv endpoints for interaction.
-3. **Training scripts** (`train/`)
-   - `kaggle_grpo.py` (single-policy SFT -> GRPO),
-   - `kaggle_grpo_league.py` (SFT -> league rounds + mini-GRPO + PFSP/PSRO updates).
 Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.
 ---
 ## Observations, actions, rewards
@@ -129,7 +141,7 @@ In practice, this looked like “safe but repetitive” behavior: valid JSON, bu
 ### Step 3: Add stabilization in single-policy GRPO
-In `kaggle_grpo.py`, we introduced shaping aligned with this issue:
 - group-level diversity penalty when one tool dominates a batch,
 - additional nudge against overusing `execute_instruction` when SFT bias is high,
@@ -141,7 +153,7 @@ This step is important because it addresses a common failure mode in small-model
 ### Step 4: Move to league training for broader robustness
-Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We moved to `kaggle_grpo_league.py`:
 - run multiple league rounds,
 - pick Red archetypes using PFSP / PSRO / mix,
@@ -166,12 +178,76 @@ This is where behavior starts to look more “field-like”: the defender is not
 ### Step 5: Turn logs into evidence, not just numbers
-We deliberately kept artifact generation rich (`training_curves.png`, per-step JSONL logs, combined league histories) so claims can be traced back to concrete run outputs. That makes debugging, comparison, and review much more grounded.
 ---
 ## Results and evidence
 Across runs, we observe the expected pattern:
 - Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
@@ -186,14 +262,13 @@ A useful way to read these results is:
 By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.
-Primary artifacts produced by the training scripts:
-- `training_curves.png`
-- `log_history.json`
-- `train_metrics.log`
-- `per_step_rewards.jsonl`
-- per-step curves under `curves/`
-- league-specific: `training_curves_all_rounds.png`, `league_state.jsonl`, `log_history_combined.json`
 These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.
@@ -203,12 +278,12 @@ These files are the evidence trail for reward trends, variance, action diversity
 CyberSelfPlay matters because it evaluates what real defenders need:
-- long-horizon, instruction-conditioned recovery,
 - adversarial interaction under uncertainty,
 - measurable progress beyond one-step task completion.
-For practitioners, it is closer to incident response realities.
-For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior.
 For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.
 For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.
@@ -217,13 +292,16 @@ For evaluation-focused work, it provides a bridge between toy tasks and operatio
 ## Why this submission can stand out
-- It tackles a hard setting that combines long horizon, partial observability, adversarial play, and mission objectives in one benchmark.
-- It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure.
-- It includes mathematical grounding, system-level structure, and artifact-level evidence in one coherent package.
-- The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible.
 ---
-## Environment link
-- Hugging Face Space: `https://huggingface.co/spaces/HarshitShri026`

+# CyberSelfPlay: Building a Cyber Defense Environment
+**Important links:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)
+**Documentation:** the math-led overview, full training table, and repository layout are in the [project README](README.md) (and this file links back to it in [Where to go next](#where-to-go-next)).
 ## Why this environment exists
+In **real-world** security operations, impact is not a single model score. It is whether a team can run long incident timelines under uncertainty while adversaries adapt. **Industry** and government playbooks for detection, containment, and recovery read like multi-step missions, not one-shot classifiers. Yet most agent benchmarks are short-horizon and mostly single-agent. Cyber defense in practice is neither:
 - decisions unfold over many steps,
 - observations are partial and noisy,
 CyberSelfPlay was built to model this gap directly: a stochastic Red-vs-Blue world where Blue must execute mission playbooks while Red applies adversarial pressure.
+What makes this direction **novel** is that we do not treat “good defense” as a single yes/no check. The agent must keep making good choices for many steps in a row, under changing pressure, while mission goals are still active. That combination (multi-step behavior + partial visibility + active adversary + mission constraints) is where many current benchmarks become too easy or unrealistic, and where **industry**-relevant **impact** is actually decided.
+### How this lines up with long-horizon and self-improvement themes
+**Theme: (super) long-horizon planning and instruction following.** Missions are **long-running** by design: scenarios scale to **many** instructions and checkpoints, with **sparse and delayed** rewards from security and mission rubrics. The agent must **decompose** response goals, **track** state and playbook progress under partial visibility, and **recover** from early mistakes over **extended trajectories**—closer to durable planning than one-shot next responses.
+**Theme: self-improvement and adaptive curricula.** The **Red vs. Blue** loop is explicit **self-play** over a **defined** scenario family. **League** work (PFSP, PSRO, and mixed) plus round-based **GRPO** changes the **opponent mix** and pressure across training, so improvement is not fitting a static list of tasks but **recursive capability growth** driven by an **adaptive curriculum** and interaction feedback on the same environment.
 ---
 r_B=-r_R-\lambda C_{\mathrm{collateral}}.
 $$
+In plain terms: if Red gains ground, Blue usually loses ground, and harmful side effects are also counted. This keeps the game honest and closer to what defenders see in **real-world** response and the trade-offs that show up in **industry** debriefs.
 ---
    - reward rubrics,
    - metrics and progress tracking,
    - scenario definitions and tool interfaces.
+2. **API server**
    - OpenEnv endpoints for interaction.
+3. **Training pipelines**
+   - a single-policy path (SFT -> GRPO),
+   - and a league-based path (SFT -> rounds + mini-GRPO + PFSP/PSRO updates).
 Together, these parts create a full loop: simulate attack/defense interactions, score behavior with mission-aware rewards, then improve the policy using those outcomes.
+**System figures (open as links or view inline in [Colab, diagrams, and repository notebooks](#colab-diagrams-and-repository-notebooks)):** the [**environment architecture** diagram (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) shows how the environment, server, and training stack connect; the [**end-to-end training flow** (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) summarizes SFT, GRPO, and league training at a glance. Placing the links here matches how engineers skim a project: first the shape of the system, then the pipeline.
 ---
 ## Observations, actions, rewards
 ### Step 3: Add stabilization in single-policy GRPO
+In the single-policy training path, we introduced shaping aligned with this issue:
 - group-level diversity penalty when one tool dominates a batch,
 - additional nudge against overusing `execute_instruction` when SFT bias is high,
 ### Step 4: Move to league training for broader robustness
+Single-policy GRPO improved behavior, but robustness against varied attacker styles needed stronger pressure. We then moved to a league-based training loop:
 - run multiple league rounds,
 - pick Red archetypes using PFSP / PSRO / mix,
 ### Step 5: Turn logs into evidence, not just numbers
+We deliberately kept artifact generation rich (training curves, per-step logs, and combined league histories) so claims can be traced back to concrete run outputs. That makes debugging, comparison, and review much more grounded.
+---
+## Colab, diagrams, and repository notebooks
+The README documents the same training recipes with **public Colab** links and **static curve images** (repeated below under [Results and evidence](#results-and-evidence)). In the repo, the `notebook/` directory holds local copies aligned with each recipe.
+### Environment diagrams (from the README)
+These SVGs are the high-level system view and training pipeline, as in the [README `Environment Architecture` and `Training Flow` sections](README.md#environment-architecture). You can open each asset directly: [**architecture (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) · [**training flow (SVG link)**](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg).
+**Architecture**
+[Open architecture diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg)
+<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg" width="800" alt="CyberSelfPlay environment architecture" />
+**Training flow**
+[Open training flow diagram in new tab (SVG)](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg)
+<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg" width="800" alt="Training flow from SFT to GRPO and league" />
+### Colab notebooks and what each path does
+| Method | Open in Colab | Local notebook in `notebook/` | In short |
+|--------|----------------|-------------------------------|----------|
+| **SFT → GRPO (Vanilla)** | [Open in Colab](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | `SFT_→_GRPO_(Vanilla).ipynb` | Supervised fine-tuning on trajectory-style data, then **vanilla GRPO** with the environment reward only: the baseline for single-policy learning. |
+| **SFT → GRPO (Anti-Collapse)** | [Open in Colab](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | `SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb` | Same SFT + GRPO stack with **diversity / anti-collapse** regularization so the policy does not collapse to a tiny set of tool actions. |
+| **League (PFSP)** | [Open in Colab](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | `League(PFSP).ipynb` | **League** training with **Prioritized Fictitious Self-Play**: opponents are sampled with weights tied to matchups, so the defender faces a shifting mixture of Red styles. |
+| **League (PSRO)** | [Open in Colab](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | `League_(PSRO) (1).ipynb` | League loop using **PSRO-style** meta-updates on a population of policies (response oracles) rather than only PFSP sampling. |
+| **League (PFSP + PSRO)** | [Open in Colab](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | `League_(PFSP_+_PSRO).ipynb` | **Combined** path: PFSP for opponent (policy) choice plus PSRO-style weighting so sampling and meta-game updates run together. |
+The notebooks mirror the table in the [README `Training Approaches` section](README.md#-training-approaches-in-this-project); Colab is the shareable run surface, and the `notebook/` files are the offline copies in this repository.
 ---
 ## Results and evidence
+### Figures from training runs (same assets as the README)
+Below are the **SFT / GRPO / league** curve figures linked from the README’s training table, plus the **SFT training loss** plot referenced for this write-up. Together they are the main visual evidence for convergence and per-method behavior.
+**SFT training loss (cross-entropy on expert trajectories).** The run shows a clean optimization trajectory: loss starts around **3.2–3.3**, stays almost flat for the first few steps, then falls steeply from roughly step **5** through **25**. After that the curve flattens: from about step **30** onward training loss sits near **0.1** (steps on the x-axis go up to about **37**), which indicates that the SFT stage has found a low-NLL fit on the demonstration data.
+<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193812/image_8_s3tzys.png" width="700" alt="SFT training loss vs steps" />
+**SFT → GRPO (Vanilla).**
+<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="700" alt="SFT to GRPO Vanilla metrics" />
+**SFT → GRPO (Anti-Collapse).**
+<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="700" alt="SFT to GRPO with anti-collapse regularization" />
+**League (PFSP).**
+<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="700" alt="League PFSP training curves" />
+**League (PSRO).**
+<img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="700" alt="League PSRO training curves" />
+**League (PFSP + PSRO).**
+<img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="700" alt="League PFSP and PSRO combined" />
+### Interpretation in one pass
 Across runs, we observe the expected pattern:
 - Blue moves from imitation-only behavior (SFT) to stronger reward-aligned behavior after GRPO.
 By the final stage, improvements are not only in average reward but also in consistency across rounds and opponent profiles.
+Primary artifacts produced during training:
+- consolidated training curves
+- full optimization history logs
+- per-step reward traces
+- per-step behavior snapshots
+- league-specific multi-round trend and meta-state reports
 These files are the evidence trail for reward trends, variance, action diversity, and round-by-round league behavior.
 CyberSelfPlay matters because it evaluates what real defenders need:
+- multi-step, instruction-conditioned recovery,
 - adversarial interaction under uncertainty,
 - measurable progress beyond one-step task completion.
+For **industry** practitioners, it is closer to incident response realities and to how blue teams think about time-to-detect, containment, and recovery.
+For researchers, it offers a reproducible testbed for strategic, multi-step agent behavior, with a **novel** mix of instruction following, tools, and adversarial pressure in one environment.
 For teams building defensive copilots or autonomous responders, this kind of environment gives a safer place to test policy behavior before production deployment.
 For evaluation-focused work, it provides a bridge between toy tasks and operationally meaningful multi-step scenarios.
 ## Why this submission can stand out
+- It tackles a **real-world**-tilted setting that combines multi-step behavior, partial observability, adversarial play, and mission objectives in one benchmark, which is an unusual and **impact**-relevant target for the field.
+- It does not stop at one training recipe; it shows a full progression from baseline to stabilized training to league pressure, with clear **industry**-minded artifacts (curves, logs, league history).
+- It includes mathematical grounding, system-level structure, and diagram-level **novelty** in how the stack is presented (see [architecture](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181551/architecture_dus774.svg) and [training flow](https://res.cloudinary.com/dgyebzm4w/image/upload/v1777181108/training-flow_a3xupo.svg) links in [Core components](#core-components) and [diagrams](#environment-diagrams-from-the-readme)), plus artifact-level evidence in one coherent package.
+- The narrative from “initial approach -> failure mode -> fix -> stronger method” is explicit and reproducible, which is what teams need to trust deployment-related claims.
 ---
+## Where to go next
+- **Project README** (formal POSG, rewards, training math, and full method table): [README.md](README.md)
+- **Hugging Face Space (live environment):** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026)
+The README and this blog point to each other so you can move between the specification-style overview and the narrative plus figures here.

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
----
-title: CyberSelfPlay (Long-Horizon Cyber POSG)
 emoji: 🛡️
 colorFrom: blue
 colorTo: red
@@ -10,17 +10,22 @@ pinned: true
 # CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
-CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for long-horizon cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.
 ## Environment on Hugging Face Space
-- **Live Space:** `https://huggingface.co/spaces/HarshitShri026`
 ---
 ## Problem and Capability Gap
-Most agent benchmarks are short-horizon and single-agent. Cyber defense in practice is long-horizon, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.
 ---
@@ -108,7 +113,7 @@ r_B &= v_1 \mathbb{1}_{\mathrm{detect}} + v_2 \mathbb{1}_{\mathrm{contain}} + v_
 \end{aligned}
 $$
-Concrete rubric implementation is in `cyber_selfplay_env/rubrics.py`.
 ---
@@ -137,9 +142,9 @@ We experiment across **SFT + GRPO baselines**, **reward smoothing**, **diversity
 | **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="350"/>|
 | **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="350"/> |
 | **🔹 League (Multi-Policy RL)** ||||
-| **League (PFSP)** | Prioritized Fictitious Self-Play for opponent sampling | [Open](https://colab.research.google.com/drive/1mDk9pzeRudjmXhU0VBVJymqF5An8bHhk?usp=sharing) | Win-rate curves |
-| **League (PSRO)** | Policy-Space Response Oracles (game-theoretic updates) | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188537/League_PSRO_wd3esy.png" width="350"/> |
-| **League (PFSP + PSRO)** | Combines adaptive sampling + meta-policy optimization | [Open](https://colab.research.google.com/drive/1OaOQYmoq2ni2FjCUukBkpt3BpT55uhX9?usp=sharing) | Meta + Reward curves |
 ---
@@ -239,7 +244,7 @@ Score each completion with reward $R^{(j)}$, compute group-relative advantages,
 ---
-## Long-Horizon Scenario Scale
 | scenario | turns | instructions | checkpoint stride |
 | --- | ---: | ---: | ---: |
@@ -255,20 +260,20 @@ Instruction progress and violation signals are tracked in environment metadata.
 Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, round-level opponent selection (PFSP / PSRO / mix) changes pressure distribution and produces distinct multi-round learning dynamics.
-Common result artifacts produced by the training scripts include:
-- `training_curves.png`
-- `log_history.json`
-- `train_metrics.log`
-- `per_step_rewards.jsonl`
-- per-step curve images under `curves/`
-- league-specific outputs such as `training_curves_all_rounds.png`, `league_state.jsonl`, and `log_history_combined.json`
 ---
 ## Why It Matters
-- **Security operations relevance:** models long-horizon defense decisions closer to real incident response.
 - **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty.
 - **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes.

+---
+title: CyberSelfPlay (Cyber POSG)
 emoji: 🛡️
 colorFrom: blue
 colorTo: red
 # CyberSelfPlay: Autonomous Red-vs-Blue Cyber Defense Environment
+**Important links:** [League (PFSP + PSRO) — Colab (mixed)](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing)
+CyberSelfPlay is an OpenEnv-compatible reinforcement learning environment for cyber defense. The setting is a partially observable, stochastic Red-vs-Blue contest where Blue must execute enterprise recovery playbooks while Red applies adversarial pressure.
 ## Environment on Hugging Face Space
+- **Live Space:** [CyberSelfPlay on Hugging Face](https://huggingface.co/spaces/HarshitShri026)
+- **Narrative, Colab context, and results figures:** [Blogs](Blogs.md)
 ---
 ## Problem and Capability Gap
+Most agent benchmarks are short and single-agent. Cyber defense in practice is multi-step, partially observable, adversarial, and stochastic. CyberSelfPlay targets that gap by coupling multi-step mission execution with attacker-defender interaction and structured tool actions.
+**Connection to long-horizon and self-play themes:** the setting stresses **(super) long-horizon planning and instruction following**—episodes with many steps, many playbook instructions, and security rewards that are often sparse or delayed, so the agent must track state, recover from mis-steps, and keep coherent plans across long runs. It also supports **self-improvement through interaction**: the training stack uses **SFT → GRPO** and **league (PFSP / PSRO / mix)** to keep pressure adaptive—opponents and rounds change, so the LLM policy is not tuned on a static task set but on an evolving, self-play–style curriculum over the same family of tasks.
 ---
 \end{aligned}
 $$
+The reward rubric is implemented directly in the environment’s scoring logic.
 ---
 | **SFT → GRPO (Vanilla)** | Baseline using only environment reward | [Open](https://colab.research.google.com/drive/1K5771KT0-2lyU6eNghqQEStBS4OSF7D7?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777187892/SFT_GRPO_Vanilla_i88mbr.png" width="350"/>|
 | **SFT → GRPO (Anti-Collapse)** | Adds diversity penalty to avoid mode collapse | [Open](https://colab.research.google.com/drive/1HivyWte1q-sugE04XsyMi1U_RY1oGkJ8?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777188452/SFT_GRPO_Anti-Collapse_Regularization_fq3mgo.png" width="350"/> |
 | **🔹 League (Multi-Policy RL)** ||||
+| **League (PFSP)** | Prioritized Fictitious Self-Play for opponent sampling | [Open](https://colab.research.google.com/drive/1g2QCBqdvo7QwRC7dJaV8QdO7RvTPGyY1?usp=sharing) | <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777194098/League_PFSP_vunfsn.png" width="350"/> |
+| **League (PSRO)** | Policy-Space Response Oracles (game-theoretic updates) | [Open](https://colab.research.google.com/drive/1O6IoE-_UloAeDXKve2ZA1W4OajychglP?usp=sharing) | <img src="https://res.cloudinary.com/dp1ejt3eb/image/upload/v1777193765/League_PSRO_ra89hw.png" width="350"/> |
+| **League (PFSP + PSRO)** | Combines adaptive sampling + meta-policy optimization | [Open](https://colab.research.google.com/drive/192y6Xf6uYjW0Z0yffBaKjtfVJGCT4b4S?usp=sharing) | <img src="https://res.cloudinary.com/dgyebzm4w/image/upload/v1777191934/League_PFSP_PSRO_kpbenx.png" width="350"/> |
 ---
 ---
+## Scenario Scale
 | scenario | turns | instructions | checkpoint stride |
 | --- | ---: | ---: | ---: |
 Across training runs, Blue policies generally move from imitation-only behavior (SFT) to stronger environment-aligned behavior after GRPO. In league mode, round-level opponent selection (PFSP / PSRO / mix) changes pressure distribution and produces distinct multi-round learning dynamics.
+Common result artifacts produced by training include:
+- consolidated training curves,
+- step-by-step optimization history,
+- metrics logs,
+- per-sample reward traces,
+- per-step visualization snapshots,
+- and, for league experiments, combined multi-round trend and meta-state reports.
 ---
 ## Why It Matters
+- **Security operations relevance:** models multi-step defense decisions closer to real incident response.
 - **Research relevance:** provides a reproducible adversarial benchmark for instruction-following under uncertainty.
 - **Evaluation relevance:** combines environment dynamics, tool-structured actions, and measurable outcomes.

cyber_selfplay_env/simulator.py CHANGED Viewed

@@ -159,7 +159,7 @@ class CyberSimulator:
             pending = [x for x in mission["instructions"] if not x["done"]]
             if pending:
                 current = pending[0]
-                # Requires matching tool in params to model long-horizon instruction following.
                 requested_tool = ""
                 if params and isinstance(params.get("required_tool"), str):
                     requested_tool = params["required_tool"]

             pending = [x for x in mission["instructions"] if not x["done"]]
             if pending:
                 current = pending[0]
+                # Requires matching tool in params to model multi-step instruction following.
                 requested_tool = ""
                 if params and isinstance(params.get("required_tool"), str):
                     requested_tool = params["required_tool"]

notebook/League(PFSP).ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/League_(PFSP_+_PSRO).ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/League_(PSRO) (1).ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

notebook/SFT_→_GRPO_(Vanilla).ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

openenv.yaml CHANGED Viewed

@@ -1,16 +1,139 @@
 env:
   name: "CyberSelfPlay"
-  author: "Hackathon Team"
-  description: "Long-horizon red-vs-blue cyber POSG with adaptive self-play curriculum and delayed instruction-following rewards."
   version: "0.1.0"
 server:
   host: "0.0.0.0"
   port: 7870
   workers: 1
   module: "server.app:app"
 features:
   multi_reward: true
   prevent_hacking: true
   curriculum_scheduler: true

 env:
   name: "CyberSelfPlay"
+  author: "Team Neuron"
+  description: "Long-horizon Red-vs-Blue cyber defense POSG with partial observability, stochastic transitions, mission-instruction progress signals, and league-style opponent pressure for robust policy learning."
   version: "0.1.0"
+  homepage: "https://huggingface.co/spaces/HarshitShri026"
+  domain: "cyber-defense"
+  tags:
+    - "openenv"
+    - "cybersecurity"
+    - "red-vs-blue"
+    - "multi-agent"
+    - "multi-step"
+    - "partially-observable"
+    - "instruction-following"
+    - "reinforcement-learning"
+    - "long-horizon"
+    - "self-play"
+    - "adaptive-curriculum"
+  # Aligns with program themes: (1) long-horizon planning & instruction following;
+  # (2) self-improvement via self-play and adaptive opponent pressure.
+  program_themes:
+    long_horizon_planning_and_instruction_following: >
+      Episodes and scenarios scale to many steps and many playbook instructions, with
+      sparse and delayed security and mission rewards. Agents must decompose response
+      goals, track partial state and instruction progress, and maintain coherent
+      behavior across long trajectories (beyond one-shot or shallow next-step reasoning).
+    self_improvement_and_adaptive_curricula: >
+      Red versus Blue interaction provides explicit self-play over a defined family of
+      cyber-defense tasks. SFT, GRPO, and league training (PFSP, PSRO, and mixed
+      meta-scheduling) vary opponent mix and round pressure, yielding adaptive-curriculum
+      style learning and recursive policy improvement on the same environment interface.
+  task_type: "sequential_decision_making"
+  horizon:
+    min_steps: 60
+    max_steps: 180
+  scenarios:
+    - name: "small"
+      turns: 60
+      instructions: 40
+      checkpoint_stride: 8
+    - name: "medium"
+      turns: 100
+      instructions: 120
+      checkpoint_stride: 12
+    - name: "large"
+      turns: 180
+      instructions: 300
+      checkpoint_stride: 20
+  agents:
+    red:
+      role: "attacker"
+      objective: "maximize foothold/privilege/lateral movement/exfiltration while avoiding detection"
+    blue:
+      role: "defender"
+      objective: "detect/contain/recover while completing ordered mission instructions"
+  observation_space:
+    red: "partial observability over attack-relevant state and outcomes"
+    blue: "partial observability over defense state, mission context, and progress metadata"
+  action_space:
+    red: "structured cyber actions for adversarial operations"
+    blue: "structured CyberAction JSON tool calls"
+  reward_model:
+    type: "multi-component"
+    notes:
+      - "dense + delayed terms"
+      - "instruction progress/checkpoint/violation shaping"
+      - "near-zero-sum coupling with collateral cost term"
+  references:
+    project_overview: "Main project overview and environment description"
+    technical_blog: "Narrative write-up with math, training journey, and results"
+    environment_components: "Simulator, rubrics, metrics, scenarios, and tool interfaces"
+    training_process: "For full training process details, refer to README.md"
+    notebooks:
+      - "notebook/SFT_→_GRPO_(Vanilla).ipynb"
+      - "notebook/SFT_→_GRPO_(Anti_Collapse_Regularization).ipynb"
+      - "notebook/League(PFSP).ipynb"
+      - 'notebook/League_(PSRO) (1).ipynb'
+      - "notebook/League_(PFSP_+_PSRO).ipynb"
+    training_paths:
+      - "Single-policy SFT to GRPO refinement"
+      - "League-based SFT to round-wise GRPO with PFSP/PSRO scheduling"
 server:
   host: "0.0.0.0"
   port: 7870
   workers: 1
   module: "server.app:app"
+  routes_hint:
+    - "/health"
+    - "/info"
+    - "/artifacts"
+  api_style: "OpenEnv-compatible FastAPI service"
 features:
   multi_reward: true
   prevent_hacking: true
   curriculum_scheduler: true
+  partial_observability: true
+  stochastic_dynamics: true
+  multi_agent: true
+  instruction_tracking: true
+  adversarial_interaction: true
+  league_training_support: true
+  pfsp_support: true
+  psro_support: true
+training:
+  primary_pipelines:
+    - name: "sft_grpo"
+      implementation: "single-policy training path"
+      summary: "SFT warm start followed by single-policy GRPO refinement"
+    - name: "sft_league_grpo"
+      implementation: "league-based training path"
+      summary: "SFT + league rounds with PFSP/PSRO/mix opponent scheduling and mini-GRPO updates"
+  artifacts:
+    common:
+      - "training curves"
+      - "optimization history logs"
+      - "metrics logs"
+      - "per-sample reward traces"
+      - "per-step visualizations"
+    league:
+      - "combined multi-round trend curves"
+      - "league state trajectory logs"
+      - "combined round history logs"
+evaluation:
+  built_in_metrics:
+    - "instruction_progress_rate"
+    - "instruction_violation_rate"
+    - "mttd"
+    - "mttr"
+    - "exfiltration_pressure"
+    - "checkpoint_progress"
+  success_characterization:
+    - "improved environment-aligned Blue reward after SFT->GRPO"
+    - "stable action diversity under anti-collapse shaping"
+    - "robustness gains under league opponent variation"

pyproject.toml CHANGED Viewed

@@ -5,7 +5,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "openenv-cyber-selfplay"
 version = "0.1.0"
-description = "Cyber defense red-vs-blue self-play environment for OpenEnv (Theme 4: self-improvement, Theme 2: long-horizon reasoning)."
 readme = "README.md"
 requires-python = ">=3.10"
 license = { text = "MIT" }

 [project]
 name = "openenv-cyber-selfplay"
 version = "0.1.0"
+description = "Cyber defense red-vs-blue self-play environment for OpenEnv (Theme 4: self-improvement, Theme 2: multi-step reasoning)."
 readme = "README.md"
 requires-python = ">=3.10"
 license = { text = "MIT" }