Demo / README.md
Ajayyy00
Update README/Space card for hotseat multiplayer and 4 architectural upgrades
4217376
---
title: CyberSOC Upgraded RLVR
emoji: 🛡️
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
---
# CyberSOC: Complete Project Review
> **New — Hotseat Multiplayer Mode**: A human can now play the Red Team live from the dashboard. After every Blue action the UI pauses, enables the **🔴 Red Team Toolkit**, shows a `BLUE / RED TEAM TURN` indicator in the header, and resumes Blue auto-play only after the Red player submits their move. Backend now runs in FSP (Fictitious Self-Play) mode by default.
> **New — 4 Architectural Upgrades**: Lazy Adversary (stall punishment + 15 % autonomous pivot), Rigid Bureaucracy emergency-isolation gate (UNJUSTIFIED\_EMERGENCY penalty), Siloed Intelligence external intel feed injection, and Depressed Analyst doomsday-clock direct modifier + negligence penalty.
## 1. Project Overview — RLVR Positioning
CyberSOC (CyberSOCEnv) is an **RLVR-stage reinforcement learning environment** that sits at the final rung of the model-maturation arc: **Random Init → Pretraining → SFT/IFT → Preference FT → RLVR**. It does not pretrain, supervise, or preference-align — it consumes a base model that has already been through those stages and turns its agentic actions into a dense, verifiable, 10-dimensional reward signal that GRPO can train on.
It assumes a base model that has already been SFT-aligned. The environment itself does not perform SFT; it is an **RL-only artifact**. This satisfies **Daniel's "Law of RL"**: *The base model must get non-zero reward on Easy before it can meaningfully learn Hard.*
**Built for**: The OpenEnv Hackathon (Meta Platforms)
**Framework**: OpenEnv (Meta's RL environment framework)
**License**: BSD-style (Meta Platforms, Inc.)
### Guide Alignment Summary
| Guide Section | Requirement | Our Implementation |
|---|---|---|
| §1 | Step-by-step, programmatic verification, hard-but-possible | 10 typed actions, 10-dim deterministic grader, Easy→Hard curriculum |
| §4 | Design env before trainer | Env designed first; reset/step/state as first-class artifacts |
| §6 | Keep task simple at first | 1000+ scenarios across 3 difficulty tiers enable curriculum learning |
| §7 | Multiple independent reward functions | 10 dimensions consumed as `reward_funcs=[...]` by GRPOTrainer |
| §8 | Protect against reward hacking | 8 distinct defenses mapped to guide's attack vectors |
| §10 | Right training stack | Unsloth (QLoRA) + TRL (GRPO) + OpenEnv (transport) |
| §11 | Prefer GRPO/RLVR | RLVR throughout; every reward is deterministic code (zero LLM-as-judge) |
| §12 | Keep inference fast | Graph-delta injection + sparse nodes = rollout-latency optimizations |
| §14 | Scale only after stable | All 9 components passed integration before any GRPO rollout |
---
## 2. Core Idea & Innovation
### The Problem
Traditional cybersecurity training environments use static puzzles with fixed answers. Real SOC work requires dynamic reasoning under time pressure with incomplete information.
### The Solution
CyberSOC creates a **fully dynamic, deterministic SOC simulation** with:
1. **Procedural Scenario Generation** — 1,003 unique attack scenarios (3 curated + 1,000 generated) from seed-based deterministic generation. Same seed = same scenario, enabling reproducible RL training.
2. **13 Threat Categories** — Ransomware, Phishing, Credential Theft, Lateral Movement, C2 Communication, Privilege Escalation, Data Exfiltration, Cryptomining, Supply Chain, Insider Threat, Webshell, Botnet, Malware.
3. **Adaptive Red Team** — An adversary that reacts to agent actions: if you isolate a host, the attacker may pivot laterally. If you kill a process without blocking IOCs, it may reinfect with a `_v2` variant.
4. **10-Dimensional Grading** — Not a binary pass/fail. Agents are scored across 10 weighted dimensions for nuanced RL credit assignment. **Zero LLM-as-a-judge.**
5. **Business Continuity Constraints** — Rash actions (isolating clean subnets, killing legitimate processes) cause business downtime penalties.
6. **TRL GRPO Integration** — 10 reward functions that plug directly into Hugging Face's TRL `GRPOTrainer` for RL fine-tuning.
---
## 3. Architecture
```
MetaRound2/
├── models.py # Pydantic data models (Observation, Action, State)
├── client.py # WebSocket client for agent interaction
├── __init__.py # Package exports
├── inference.py # LLM baseline inference script
├── dashboard_server.py # Dashboard + API server launcher
├── pyproject.toml # Python package config
├── Dockerfile # HuggingFace Spaces deployment
├── openenv.yaml # 1003 task manifest
├── validate_submission.sh # Hackathon submission validator
├── server/ # Backend environment engine
│ ├── app.py # FastAPI application entry point
│ ├── play_environment.py # Core environment (1284 lines)
│ ├── tasks.py # Hand-crafted task definitions (easy/medium/hard)
│ ├── task_generator.py # Procedural generation engine (1000+ tasks)
│ ├── graders.py # 10-dimensional grading system
│ ├── threat_graph.py # Typed knowledge graph
│ ├── soar_playbooks.py # 5 SOAR playbook definitions
│ ├── action_validation.py # 3-gate action validation middleware
│ ├── tool_router.py # Phase state machine + triage solver
│ ├── episode_sandbox.py # Wall-clock + step-limit guard
│ ├── visualize_graph.py # PNG graph renderer (matplotlib/networkx)
│ └── Dockerfile # Multi-stage Docker build
├── training/ # RL training integration
│ └── reward_funcs.py # 10 TRL GRPO reward functions
├── dashboard/ # Real-time web dashboard
│ ├── index.html # Main HTML (6 panels)
│ ├── css/styles.css # Dark theme CSS (25KB)
│ └── js/
│ ├── app.js # Main dashboard logic (45KB)
│ ├── graphs.js # D3.js threat graph + Chart.js (31KB)
│ ├── api.js # REST API client
│ └── animations.js # Micro-animations & effects
└── tests/ # 10 test files + integration suite
├── test_integration.py
└── test_task1.py ... test_task9.py
```
---
## 4. Backend (Server)
### 4.1 Core Environment — `play_environment.py`
The heart of the project. `CyberSOCEnvironment` extends OpenEnv's `Environment` interface.
**Key features:**
- **`reset(task_id)`** — Builds the network, injects attack chains, initializes alert queue, seeds the ThreatGraph
- **`step(action)`** — Processes one agent action, computes rewards, updates state, triggers adaptive adversary
- **Concurrent sessions** — Each WebSocket connection gets its own environment instance
- **ActionMiddleware** — Pre-flight validation (phase violations, graph-groundedness) before consuming a step
**10 Agent Actions:**
| # | Action | Purpose | Reward Range |
|---|--------|---------|-------------|
| 1 | `query_host` | Map architecture, get endpoint info | -0.05 to +0.05 |
| 2 | `run_forensics` | Deep system artifact extraction | -0.02 to +0.10 |
| 3 | `kill_process` | Terminate malicious execution | -0.08 to +0.25 |
| 4 | `block_ioc` | Blacklist IOCs network-wide | -0.03 to +0.15 |
| 5 | `isolate_segment` | Quarantine subnet or host | -0.10 to +0.15 |
| 6 | `correlate_alerts` | Find shared entities across alerts | ±0.05 |
| 7 | `enrich_ioc` | Threat-intel enrichment (actor, TTPs) | ±0.05 |
| 8 | `scan_host_vulnerabilities` | Discover CVEs on a host | ±0.05 |
| 9 | `trigger_playbook` | Execute SOAR automated response | ±0.10 |
| 10 | `submit_containment_plan` | Final report — ends episode | 0.0 to 1.0 |
### 4.2 Data Models — `models.py`
All data flows through strict **Pydantic models** (429 lines):
- **Enums**: `Severity`, `ThreatType` (13 types), `HostStatus`, `SubnetRole` (6 roles)
- **Sub-models**: `Alert`, `HostInfo`, `NetworkTopology`, `ForensicsResult`, `TimelineEntry`
- **`SOCObservation`** (extends OpenEnv `Observation`): 20+ fields including `alert_queue`, `network_topology`, `host_forensics`, `threat_graph_summary`, `reward_dimensions`, `available_playbooks`
- **Actions**: Discriminated union of 10 action types via `SOCActionWrapper`
- **`SOCState`** (internal): Tracks all episode state — killed processes, blocked IOCs, isolated subnets, etc.
### 4.3 Task Definitions — `tasks.py`
Three hand-crafted benchmark scenarios:
| Task | Threats | Hosts | Max Steps | Description |
|------|---------|-------|-----------|-------------|
| **Easy** | 1 | 1 | 15 | Single ransomware on WS-042 |
| **Medium** | 3 | 4 | 25 | Phishing → credential theft → lateral movement across 3 subnets |
| **Hard** | 5 | 7 | 30 | Full APT: phishing → C2 → privesc → exfil → ransomware |
**Network**: ~75 active hosts across 6 subnets (corporate, engineering, finance, DMZ, datacenter, executive) with realistic processes, ports, and criticality scores.
### 4.4 Procedural Task Generator — `task_generator.py`
Generates **1,000+ unique deterministic scenarios** from a seed:
- `hash(task_id)` → deterministic `random.Random` seed → drives ALL choices
- **Template pools**: 90+ malware process names, 40 C2 domains, 36 C2 IPs, 12 ransomware extensions, 12 data types
- **3 difficulty tiers**: Easy (1 threat), Medium (2-3 threats, multi-stage chains), Hard (3-6 threats, APT campaigns)
- **Alert generation**: Templated descriptions with randomized details (timestamps, file counts, data sizes)
### 4.5 Grading System — `graders.py`
**10-dimensional weighted grading:**
| Dimension | Weight | What It Measures |
|-----------|--------|-----------------|
| `threat_containment` | 0.20 | Fraction of required process kills completed |
| `ioc_blocking` | 0.12 | Fraction of known IOCs blocked (penalizes blind blocking) |
| `forensic_investigation` | 0.10 | Compromised hosts examined |
| `siem_correlation` | 0.08 | Whether alerts were correlated (bonus for early correlation) |
| `threat_intel_usage` | 0.08 | IOCs enriched with threat intel |
| `vuln_root_cause` | 0.08 | CVE root causes discovered (bonus if cited in plan) |
| `business_impact` | 0.10 | Penalizes unnecessary isolation and over-isolation (>20% = -0.30) |
| `step_efficiency` | 0.07 | Rewards SOAR playbook usage, penalizes step overrun |
| `plan_coverage` | 0.10 | Threats addressed in final plan |
| `plan_evidence_quality` | 0.07 | Evidence confidence from ThreatGraph |
**Anti-gaming**: Per-occurrence penalty cap (±0.15), blind-blocking penalties, normalized evidence confidence.
### 4.6 Threat Graph — `threat_graph.py`
A **typed knowledge graph** tracking all SOC entities:
- **5 Node Types**: `HostNode`, `ProcessNode`, `IOCNode`, `VulnerabilityNode`, `AlertNode`
- **6 Edge Types**: `runs_on`, `involves`, `communicates_with`, `pivoted_from`, `part_of_chain`, `exploits`
- **200-node cap** with LRU IOC pruning
- **Version tracking** with changelog for delta queries
- **Evidence confidence** computation for plan quality scoring
- **Context summary** generation for LLM injection
### 4.7 SOAR Playbooks — `soar_playbooks.py`
5 automated response playbooks with prerequisite validation:
| Playbook | Prerequisites | Sub-Actions |
|----------|--------------|-------------|
| `ransomware_containment` | Forensics run, process identified | kill_process, block_ioc |
| `c2_disruption` | IOC enriched, C2 IP identified | block_ioc, isolate_segment |
| `lateral_movement_lockdown` | Forensics run, lateral movement detected | kill_process, isolate_segment |
| `phishing_response` | Phishing vector confirmed | enrich_ioc, block_ioc |
| `data_exfil_stop` | Forensics run, exfil destination identified | block_ioc, kill_process |
### 4.8 Action Validation — `action_validation.py`
**3-gate middleware:**
1. **Phase whitelist** — Actions restricted by phase (triage/investigation/remediation/report)
2. **Schema validation** — Required arguments checked
3. **Graph groundedness** — Actions must reference discovered entities (can't block an IOC you haven't seen)
### 4.9 Tool Router — `tool_router.py`
**Deterministic phase state machine:**
- Phases: `triage``investigation``remediation``report``done`
- Loop limits: max 4 investigation loops, 3 remediation loops
- Supports **pushback** — agent can justify staying in a phase with graph references
**Triage Solver**: Priority = `severity_weight × criticality_weight × (1 + blast_radius/10)`
### 4.10 Episode Sandbox — `episode_sandbox.py`
**Safety guardrails:**
- **120-second wall-clock timeout** per episode
- **20-step hard limit** per episode
- **State integrity protection** — Protected fields (`_task_def`, `_live_requirements`, `_threat_graph`) are snapshot-hashed; mutations are detected and rolled back
- **Hacking detection** — Reports any external state tampering
### 4.11 Adaptive Red Team
Two mechanisms in `play_environment.py`:
1. **Reinfection** (`_maybe_reinfect`): 30% chance when killing a process if IOCs in the chain are unblocked → spawns `process_v2` variant + CRITICAL alert
2. **Lateral Pivot** (`_execute_lateral_pivot`): Triggered by isolate/kill actions on hard tasks → copies malware to adjacent healthy host, adds `pivoted_from` edge, emits PIVOT alert, updates live requirements
**Escalation**: Probability increases when agent is slow (step > 10 with 0 containments).
### 4.12 Server Application — `app.py`
FastAPI app created via OpenEnv's `create_app()`:
- **POST /reset** — Reset environment with task_id
- **POST /step** — Execute an action
- **GET /state** — Get current state
- **WS /ws** — WebSocket for persistent sessions
- CORS enabled for dashboard communication
- Supports 4 concurrent environment instances
---
## 5. Frontend (Dashboard)
### 5.1 Overview
A real-time **"CyberSOC Command Center"** web dashboard with 6 panels, built with vanilla HTML/CSS/JS + D3.js + Chart.js.
### 5.2 Six Dashboard Panels
1. **Alert Queue** — Live SIEM/EDR alerts with severity badges and IOC indicators
2. **Live Threat Graph** — D3.js force-directed graph with 5 node types, drag/zoom, glow effects, pivot animation
3. **Agent Actions** — Chronological action log with reward tracking
4. **Network Topology** — Visual subnet map with compromised/isolated counts
5. **Performance Metrics** — Chart.js radar chart (10 dimensions) + cumulative reward timeline
6. **Mission Status** — Containment progress bars, business impact gauge, active threat list, episode controls, **🔴 Red Team Toolkit** (hotseat multiplayer)
### 5.3 Visual Design
- **Dark theme** with glassmorphism panels
- **Typography**: Inter (UI) + JetBrains Mono (data)
- **Color system**: Accent colors for cyan, green, amber, red, purple
- **Animations**: Count-up numbers, scale bounces, pulse glows, screen flashes
- **Red Team pivot**: Screen border flash, toast notification, traveling dot animation on pivot edges
### 5.4 Key Frontend Components
**`graphs.js` (881 lines)**:
- `ClientThreatGraph` — Client-side graph state manager synced from observations
- `ThreatGraphViz` — D3.js v7 force simulation with SVG glow filters, curved edges, node symbols (circle/diamond/triangle/square/wye), click-to-highlight, drag behavior
- `RadarChart` — Chart.js 10-axis radar for live grading dimensions
- `RewardTimeline` — Gradient-filled cumulative reward line chart
**`app.js` (45KB)** — Main orchestrator handling episode lifecycle, API calls, UI updates, phase indicator tracking
**`api.js`** — REST client with auto-detection of server origin, session management, response parsing
**`animations.js`** — Utility library for count-up, screen flash, toast notifications, scale bounce, pulse glow, dramatic final score reveal
### 5.5 Dashboard Server — `dashboard_server.py`
Wraps the FastAPI app to also serve the dashboard as static files at `/dashboard/`. Prints a styled ASCII banner on startup.
---
## 6. Inference & Training
### 6.1 Inference Script — `inference.py`
LLM baseline agent using **OpenAI-compatible API**:
- System prompt defines SOC analyst role with all 6 core actions
- Formats observations into structured text for the LLM
- Parses JSON actions from LLM responses (with fallback extraction)
- Runs episodes across easy/medium/hard tasks
- Emits structured stdout logs: `[START]`, `[STEP]`, `[END]` (hackathon requirement)
- Default model: `Qwen/Qwen2.5-72B-Instruct` via HuggingFace Router
### 6.2 GRPO Reward Functions — `training/reward_funcs.py`
10 TRL-compatible reward functions for **Group Relative Policy Optimization**:
```python
from training.reward_funcs import make_soc_reward_funcs
reward_fns = make_soc_reward_funcs("http://localhost:8000")
trainer = GRPOTrainer(model=model, reward_funcs=reward_fns, args=GRPOConfig(...))
```
Each function:
1. Parses completion as JSON action list
2. Replays actions against live environment server
3. Returns the specific dimension's score from `grade_breakdown`
4. Non-parseable completions return 0.0
### 6.3 Per-Step Reward Dimensions
The environment computes **live partial scores** every step (`_compute_reward_dimensions`) for GRPO credit assignment without waiting for the terminal grade. These are exposed in `SOCObservation.reward_dimensions`.
---
## 7. Testing
**11 test files** covering all major components:
| File | Focus |
|------|-------|
| `test_integration.py` | Full episode flows, phase violations, adaptive pivots, 10-dim grading, sandbox limits |
| `test_task1.py` - `test_task9.py` | Individual task-specific validations |
Key integration tests:
- Easy/medium episodes complete without crashes
- All 10 action types can be exercised in a single episode
- Phase violations return negative reward (not crash)
- Adaptive pivot fires on hard tasks
- Step rewards accumulate correctly and are idempotent
- Grader returns exactly 10 dimensions
- Sandbox step limit raises `EpisodeTimeout`
---
## 8. Deployment & DevOps
### Docker
- **Root Dockerfile** — Slim Python 3.10, serves on port 7860 (HuggingFace Spaces)
- **Server Dockerfile** — Multi-stage build from `ghcr.io/meta-pytorch/openenv-base`, uses `uv` for dependency management, health check on `/health`
### Validation
`validate_submission.sh` — 3-step validator:
1. Ping HF Space `/reset` endpoint
2. Docker build succeeds
3. `openenv validate` passes
### OpenEnv Manifest
`openenv.yaml` — 1,003 task definitions with descriptions, max steps, and difficulty tags. Used by the OpenEnv framework for task discovery and benchmarking.
---
## 9. Environment Variables
| Variable | Purpose | Default |
|----------|---------|---------|
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
| `HF_TOKEN` | HuggingFace API key | — |
---
## 10. Data Flow
```mermaid
sequenceDiagram
participant Agent as LLM Agent
participant Inf as inference.py
participant Env as CyberSOCEnvironment
participant TG as ThreatGraph
participant Gr as Grader
Inf->>Env: reset(task_id="hard")
Env->>TG: populate from task_def
Env-->>Inf: SOCObservation (alerts, topology)
loop Each Step
Inf->>Agent: format_observation → LLM prompt
Agent-->>Inf: JSON action
Inf->>Env: step(SOCActionWrapper)
Env->>Env: ActionMiddleware.validate()
Env->>Env: Handle action (query/forensics/kill/etc)
Env->>TG: Update graph nodes/edges
Env->>Env: _adversary_react() (adaptive pivot)
Env->>Env: _compute_reward_dimensions()
Env-->>Inf: SOCObservation (updated state)
end
Inf->>Env: step(submit_containment_plan)
Env->>Gr: grade_episode(actions, plan, graph, task_def, state)
Gr-->>Env: {final_score, breakdown[10], penalties, bonuses}
Env-->>Inf: SOCObservation (done=true, final_score)
```
---
## 11. Red Team Design Philosophy
The Red Team is NOT a separate LLM agent. It is a **deterministic adversarial dynamics engine** that defines the environment's state transition function.
### 7 Behavioral Mechanisms
1. **Reactive Pivoting**: Triggers on `isolate_segment` and `kill_process` (copy-not-move spread)
2. **Persistence**: Reinfection triggers when a process is killed but its root IOC remains unblocked (teaches causal reasoning)
3. **Time Pressure**: Pivot probability escalates +0.2 after step 10 if zero containments are achieved
4. **Controlled Randomness**: Uses an episode-scoped `self._rng` (seeded by `task_id`) to ensure deterministic rollouts
5. **Noisy Observations**: Benign processes mixed in host data
6. **Escalation**: Pivot probabilities scale with difficulty (`Easy: 0.0`, `Medium: 0.3`, `Hard: 0.8`)
7. **Stall Punishment** *(new)*: If Blue makes 3+ consecutive passive actions (`query_host` / `pass_turn`) without containment, Red immediately deploys ransomware; plus a 15 % chance to spread laterally even on passive Blue turns
### Attack Lifecycle Model (MITRE-aligned)
`Phase 1: Compromise``Phase 2: Lateral Movement``Phase 3: Persistence``Phase 4: Escalation``Phase 5: Impact`
---
## 12. Reward-Hacking Defense Map
Per guide §8, we implemented specific defenses against the known RL exploit vectors:
| Guide Attack Vector | Our Defense |
|---|---|
| Editing timers | `EpisodeSandbox` wall-clock enforcement |
| Caching results | Idempotent step rewards via `_fired_step_rewards` |
| Abusing globals | Instance-scoped RNG + episode-scoped `self._rng` |
| Mutating protected state | Sandbox hash-snapshot + rollback |
| Exploiting env bugs | 3-gate validation middleware |
| Reward-function gaming | Evidence confidence normalization |
| Cheating via blind remediation | Graph-groundedness gate |
| Blind IOC blocking | Enrichment-before-block penalty |
---
## 13. Curriculum Learning Strategy
The 1000+ deterministic scenarios generated by `task_generator.py` are explicitly divided into three difficulty tiers to support Curriculum Learning (Guide §6).
This exists precisely to satisfy **Daniel's Law of RL**: *The base model must get non-zero reward on Easy before it can meaningfully learn Hard.*
- **Phase 1 (Warm-Start)**: `gen_0001``gen_0333` (Easy). Single threat, 15 max steps, 0.0 pivot probability.
- **Phase 2 (Scaling)**: `gen_0334``gen_0666` (Medium). Multi-stage, 25 max steps, 0.3 pivot probability.
- **Phase 3 (Stress-Test)**: `gen_0667``gen_1000` (Hard). APT, 30 max steps, 0.8 pivot probability.
The adaptive pivot probability is itself a curriculum signal; the environment gets harder as the agent gets better.
---
## 14. Intended Training Stack
CyberSOCEnv is designed for the canonical stack specified in Guide §10:
1. **Unsloth**: 4-bit QLoRA loading and efficient inference
2. **TRL**: `GRPOTrainer` consuming our 10 independent callable functions via `reward_funcs=[...]`
3. **OpenEnv**: WebSocket transport and session isolation
4. **vLLM**: Serving the rollout workers for maximum throughput
A reference adapter module exists at `training/reward_funcs.py` that mirrors the Unsloth 2048 notebook structure 1:1, allowing plug-and-play GRPO training.
---
## 15. Anti-Patterns Avoided
How we avoided the 7 common mistakes listed in Guide §21:
1. **Building before designing env**: Env, types, and sandbox were built and tested completely offline before any trainer was attached.
2. **LLM-as-a-judge**: CyberSOCEnv uses zero LLM-as-judge signals. Everything is deterministic code against the ThreatGraph.
3. **Single monolithic reward**: We use a 10-dimensional verifiable rubric, fed independently into TRL.
4. **Ignoring inference latency**: We implemented Graph Delta Injection (~10x fewer tokens) and a sparse-node generation strategy (~75 active nodes) specifically to optimize GRPO rollout latency.
5. **No abuse prevention**: 3-gate middleware + EpisodeSandbox explicitly prevent out-of-band cheating.
6. **Delayed deployment**: Environment was packaged with Docker and deployed to HF Spaces early.
7. **Scaling prematurely**: All 9 components passed integration testing (`test_integration.py` through `test_task9.py`) before scaling to 1000 tasks.
---
## 16. Known Deviations & Alignment Items
While we strive to match the OpenEnv canonical scaffolding (Guide §5), there are a few intentional architectural differences:
1. **Action Dispatch**: We use a discriminated union wrapper (`SOCActionWrapper` with a `type` field) rather than a single flat action class. This matches the MCP ToolCall pattern and real SOC work better than a flat action space.
2. **Decoupled Engine**: The core logic lives in `server/play_environment.py`, completely separate from the FastAPI transport layer in `server/app.py`. This ensures we can run headless parallel environments during GRPO without HTTP overhead if needed.
---
## 17. Team Structure & Role Split
Per guide §17, responsibilities are split across three functional roles to execute the RL pipeline effectively.
### Role 1: Environment Engineer
**Mission**: Build a deterministic, unhackable, fast environment.
**Owns**: `play_environment.py`, `tasks.py`, `threat_graph.py`, `episode_sandbox.py`, `action_validation.py`, `tests/`
- **Scope**: Implements the state machine, Red Team behavior, and validates actions. Owns the core `step()` and `reset()` loops. Ensures the environment parses valid inputs and securely handles invalid ones.
- **Hackathon Focus**: Bug fixes, latency optimization (graph deltas), sandbox integrity, and procedural scenario generation.
### Role 2: Reward Engineer
**Mission**: Design the mathematical signals that shape model behavior.
**Owns**: `graders.py`, `training/reward_funcs.py`, `models.py`
- **Scope**: Creates the 10-dimensional verifiable grading logic. Plumbs the environment outputs into TRL-compatible `reward_funcs`. Tunes the penalties to prevent reward hacking (e.g., punishing blind IOC blocking).
- **Hackathon Focus**: Ensuring the model gets positive step-rewards early on to prevent it from collapsing, while preventing it from finding "lazy" exploits.
### Role 3: Training Engineer
**Mission**: Execute the GRPO curriculum and produce the final model.
**Owns**: `training/` directory, Colab notebooks, `inference.py`
- **Scope**: Sets up the actual training loops using Unsloth and TRL. Manages the hyperparameter tuning, LoRA checkpointing, and vLLM inference configuration. Runs the curriculum from Easy to Hard.
- **Hackathon Focus**: Capturing the before/after learning curves on held-out tasks to prove to the judges that the environment actually works to train a model.
---
## 18. Key Innovations Summary
| Innovation | Description |
|-----------|-------------|
| **Procedural Generation** | SHA-256 seeded RNG generates 1000+ unique deterministic scenarios |
| **ThreatGraph** | Typed knowledge graph with version tracking, evidence confidence, and LRU pruning |
| **10-Dim Grading** | Weighted multi-dimensional scoring replacing binary pass/fail |
| **Adaptive Red Team** | Attacker reacts to defender actions — lateral pivots and reinfection |
| **SOAR Playbooks** | Prerequisite-gated automated response workflows |
| **3-Gate Validation** | Phase whitelist + schema + graph-groundedness prevents invalid actions |
| **Episode Sandbox** | State integrity protection with hash-based tampering detection |
| **Live GRPO Signals** | Per-step reward dimensions for RL credit assignment |
| **Anti-Gaming** | Blind-blocking penalties, over-isolation cap, idempotent step rewards (0.40 cap) |
| **Real-time Dashboard** | D3.js threat graph with pivot animations and 10-dim radar chart |
| **Hotseat Multiplayer** | Human Red Team player via in-dashboard toolkit; FSP backend; per-turn UI lock |
| **Stall Punishment** | 3 consecutive passive Blue actions triggers ransomware deploy + 15 % autonomous pivot |
| **Emergency Gate** | `isolate_segment` in triage requires a critical alert; `UNJUSTIFIED_EMERGENCY` → -0.15 penalty |
| **External Intel Feed** | `task_def.external_intel_feed` IOCs injected at reset — immediately blockable/enrichable |
| **Doomsday Clock** | `state.business_impact × 0.30` applied as a direct score modifier; 90 % negligence crush when no threats contained |
---
## 19. Technology Stack
| Layer | Technologies |
|-------|-------------|
| **Backend** | Python 3.10+, FastAPI, Uvicorn, Pydantic v2, OpenEnv Core |
| **Frontend** | Vanilla HTML/CSS/JS, D3.js v7, Chart.js v4, Inter/JetBrains Mono fonts |
| **Inference** | OpenAI Python SDK, asyncio |
| **Training** | TRL (Hugging Face), GRPO |
| **DevOps** | Docker (multi-stage), uv package manager, pytest |
| **Deployment** | HuggingFace Spaces (Docker SDK) |
| **Visualization** | NetworkX + Matplotlib (server-side PNG), D3.js (client-side interactive) |