Spaces:

Ajay00747
/

Demo

Sleeping

App Files Files Community

Demo / README.md

Ajayyy00

Update README/Space card for hotseat multiplayer and 4 architectural upgrades

4217376 about 1 month ago

preview code

raw

history blame contribute delete

29 kB

	---
	title: CyberSOC Upgraded RLVR
	emoji: 🛡️
	colorFrom: red
	colorTo: purple
	sdk: docker
	pinned: false
	---

	# CyberSOC: Complete Project Review

	> New — Hotseat Multiplayer Mode: A human can now play the Red Team live from the dashboard. After every Blue action the UI pauses, enables the 🔴 Red Team Toolkit, shows a `BLUE / RED TEAM TURN` indicator in the header, and resumes Blue auto-play only after the Red player submits their move. Backend now runs in FSP (Fictitious Self-Play) mode by default.

	> New — 4 Architectural Upgrades: Lazy Adversary (stall punishment + 15 % autonomous pivot), Rigid Bureaucracy emergency-isolation gate (UNJUSTIFIED\_EMERGENCY penalty), Siloed Intelligence external intel feed injection, and Depressed Analyst doomsday-clock direct modifier + negligence penalty.

	## 1. Project Overview — RLVR Positioning

	CyberSOC (CyberSOCEnv) is an RLVR-stage reinforcement learning environment that sits at the final rung of the model-maturation arc: Random Init → Pretraining → SFT/IFT → Preference FT → RLVR. It does not pretrain, supervise, or preference-align — it consumes a base model that has already been through those stages and turns its agentic actions into a dense, verifiable, 10-dimensional reward signal that GRPO can train on.

	It assumes a base model that has already been SFT-aligned. The environment itself does not perform SFT; it is an RL-only artifact. This satisfies Daniel's "Law of RL": The base model must get non-zero reward on Easy before it can meaningfully learn Hard.

	Built for: The OpenEnv Hackathon (Meta Platforms)
	Framework: OpenEnv (Meta's RL environment framework)
	License: BSD-style (Meta Platforms, Inc.)

	### Guide Alignment Summary

	\| Guide Section \| Requirement \| Our Implementation \|
	\|---\|---\|---\|
	\| §1 \| Step-by-step, programmatic verification, hard-but-possible \| 10 typed actions, 10-dim deterministic grader, Easy→Hard curriculum \|
	\| §4 \| Design env before trainer \| Env designed first; reset/step/state as first-class artifacts \|
	\| §6 \| Keep task simple at first \| 1000+ scenarios across 3 difficulty tiers enable curriculum learning \|
	\| §7 \| Multiple independent reward functions \| 10 dimensions consumed as `reward_funcs=[...]` by GRPOTrainer \|
	\| §8 \| Protect against reward hacking \| 8 distinct defenses mapped to guide's attack vectors \|
	\| §10 \| Right training stack \| Unsloth (QLoRA) + TRL (GRPO) + OpenEnv (transport) \|
	\| §11 \| Prefer GRPO/RLVR \| RLVR throughout; every reward is deterministic code (zero LLM-as-judge) \|
	\| §12 \| Keep inference fast \| Graph-delta injection + sparse nodes = rollout-latency optimizations \|
	\| §14 \| Scale only after stable \| All 9 components passed integration before any GRPO rollout \|

	---

	## 2. Core Idea & Innovation

	### The Problem
	Traditional cybersecurity training environments use static puzzles with fixed answers. Real SOC work requires dynamic reasoning under time pressure with incomplete information.

	### The Solution
	CyberSOC creates a fully dynamic, deterministic SOC simulation with:

	1. Procedural Scenario Generation — 1,003 unique attack scenarios (3 curated + 1,000 generated) from seed-based deterministic generation. Same seed = same scenario, enabling reproducible RL training.
	2. 13 Threat Categories — Ransomware, Phishing, Credential Theft, Lateral Movement, C2 Communication, Privilege Escalation, Data Exfiltration, Cryptomining, Supply Chain, Insider Threat, Webshell, Botnet, Malware.
	3. Adaptive Red Team — An adversary that reacts to agent actions: if you isolate a host, the attacker may pivot laterally. If you kill a process without blocking IOCs, it may reinfect with a `_v2` variant.
	4. 10-Dimensional Grading — Not a binary pass/fail. Agents are scored across 10 weighted dimensions for nuanced RL credit assignment. Zero LLM-as-a-judge.
	5. Business Continuity Constraints — Rash actions (isolating clean subnets, killing legitimate processes) cause business downtime penalties.
	6. TRL GRPO Integration — 10 reward functions that plug directly into Hugging Face's TRL `GRPOTrainer` for RL fine-tuning.

	---

	## 3. Architecture

	```
	MetaRound2/
	├── models.py # Pydantic data models (Observation, Action, State)
	├── client.py # WebSocket client for agent interaction
	├── __init__.py # Package exports
	├── inference.py # LLM baseline inference script
	├── dashboard_server.py # Dashboard + API server launcher
	├── pyproject.toml # Python package config
	├── Dockerfile # HuggingFace Spaces deployment
	├── openenv.yaml # 1003 task manifest
	├── validate_submission.sh # Hackathon submission validator
	│
	├── server/ # Backend environment engine
	│ ├── app.py # FastAPI application entry point
	│ ├── play_environment.py # Core environment (1284 lines)
	│ ├── tasks.py # Hand-crafted task definitions (easy/medium/hard)
	│ ├── task_generator.py # Procedural generation engine (1000+ tasks)
	│ ├── graders.py # 10-dimensional grading system
	│ ├── threat_graph.py # Typed knowledge graph
	│ ├── soar_playbooks.py # 5 SOAR playbook definitions
	│ ├── action_validation.py # 3-gate action validation middleware
	│ ├── tool_router.py # Phase state machine + triage solver
	│ ├── episode_sandbox.py # Wall-clock + step-limit guard
	│ ├── visualize_graph.py # PNG graph renderer (matplotlib/networkx)
	│ └── Dockerfile # Multi-stage Docker build
	│
	├── training/ # RL training integration
	│ └── reward_funcs.py # 10 TRL GRPO reward functions
	│
	├── dashboard/ # Real-time web dashboard
	│ ├── index.html # Main HTML (6 panels)
	│ ├── css/styles.css # Dark theme CSS (25KB)
	│ └── js/
	│ ├── app.js # Main dashboard logic (45KB)
	│ ├── graphs.js # D3.js threat graph + Chart.js (31KB)
	│ ├── api.js # REST API client
	│ └── animations.js # Micro-animations & effects
	│
	└── tests/ # 10 test files + integration suite
	├── test_integration.py
	└── test_task1.py ... test_task9.py
	```

	---

	## 4. Backend (Server)

	### 4.1 Core Environment — `play_environment.py`

	The heart of the project. `CyberSOCEnvironment` extends OpenEnv's `Environment` interface.

	Key features:
	- `reset(task_id)` — Builds the network, injects attack chains, initializes alert queue, seeds the ThreatGraph
	- `step(action)` — Processes one agent action, computes rewards, updates state, triggers adaptive adversary
	- Concurrent sessions — Each WebSocket connection gets its own environment instance
	- ActionMiddleware — Pre-flight validation (phase violations, graph-groundedness) before consuming a step

	10 Agent Actions:

	\| # \| Action \| Purpose \| Reward Range \|
	\|---\|--------\|---------\|-------------\|
	\| 1 \| `query_host` \| Map architecture, get endpoint info \| -0.05 to +0.05 \|
	\| 2 \| `run_forensics` \| Deep system artifact extraction \| -0.02 to +0.10 \|
	\| 3 \| `kill_process` \| Terminate malicious execution \| -0.08 to +0.25 \|
	\| 4 \| `block_ioc` \| Blacklist IOCs network-wide \| -0.03 to +0.15 \|
	\| 5 \| `isolate_segment` \| Quarantine subnet or host \| -0.10 to +0.15 \|
	\| 6 \| `correlate_alerts` \| Find shared entities across alerts \| ±0.05 \|
	\| 7 \| `enrich_ioc` \| Threat-intel enrichment (actor, TTPs) \| ±0.05 \|
	\| 8 \| `scan_host_vulnerabilities` \| Discover CVEs on a host \| ±0.05 \|
	\| 9 \| `trigger_playbook` \| Execute SOAR automated response \| ±0.10 \|
	\| 10 \| `submit_containment_plan` \| Final report — ends episode \| 0.0 to 1.0 \|

	### 4.2 Data Models — `models.py`

	All data flows through strict Pydantic models (429 lines):

	- Enums: `Severity`, `ThreatType` (13 types), `HostStatus`, `SubnetRole` (6 roles)
	- Sub-models: `Alert`, `HostInfo`, `NetworkTopology`, `ForensicsResult`, `TimelineEntry`
	- `SOCObservation` (extends OpenEnv `Observation`): 20+ fields including `alert_queue`, `network_topology`, `host_forensics`, `threat_graph_summary`, `reward_dimensions`, `available_playbooks`
	- Actions: Discriminated union of 10 action types via `SOCActionWrapper`
	- `SOCState` (internal): Tracks all episode state — killed processes, blocked IOCs, isolated subnets, etc.

	### 4.3 Task Definitions — `tasks.py`

	Three hand-crafted benchmark scenarios:

	\| Task \| Threats \| Hosts \| Max Steps \| Description \|
	\|------\|---------\|-------\|-----------\|-------------\|
	\| Easy \| 1 \| 1 \| 15 \| Single ransomware on WS-042 \|
	\| Medium \| 3 \| 4 \| 25 \| Phishing → credential theft → lateral movement across 3 subnets \|
	\| Hard \| 5 \| 7 \| 30 \| Full APT: phishing → C2 → privesc → exfil → ransomware \|

	Network: ~75 active hosts across 6 subnets (corporate, engineering, finance, DMZ, datacenter, executive) with realistic processes, ports, and criticality scores.

	### 4.4 Procedural Task Generator — `task_generator.py`

	Generates 1,000+ unique deterministic scenarios from a seed:

	- `hash(task_id)` → deterministic `random.Random` seed → drives ALL choices
	- Template pools: 90+ malware process names, 40 C2 domains, 36 C2 IPs, 12 ransomware extensions, 12 data types
	- 3 difficulty tiers: Easy (1 threat), Medium (2-3 threats, multi-stage chains), Hard (3-6 threats, APT campaigns)
	- Alert generation: Templated descriptions with randomized details (timestamps, file counts, data sizes)

	### 4.5 Grading System — `graders.py`

	10-dimensional weighted grading:

	\| Dimension \| Weight \| What It Measures \|
	\|-----------\|--------\|-----------------\|
	\| `threat_containment` \| 0.20 \| Fraction of required process kills completed \|
	\| `ioc_blocking` \| 0.12 \| Fraction of known IOCs blocked (penalizes blind blocking) \|
	\| `forensic_investigation` \| 0.10 \| Compromised hosts examined \|
	\| `siem_correlation` \| 0.08 \| Whether alerts were correlated (bonus for early correlation) \|
	\| `threat_intel_usage` \| 0.08 \| IOCs enriched with threat intel \|
	\| `vuln_root_cause` \| 0.08 \| CVE root causes discovered (bonus if cited in plan) \|
	\| `business_impact` \| 0.10 \| Penalizes unnecessary isolation and over-isolation (>20% = -0.30) \|
	\| `step_efficiency` \| 0.07 \| Rewards SOAR playbook usage, penalizes step overrun \|
	\| `plan_coverage` \| 0.10 \| Threats addressed in final plan \|
	\| `plan_evidence_quality` \| 0.07 \| Evidence confidence from ThreatGraph \|

	Anti-gaming: Per-occurrence penalty cap (±0.15), blind-blocking penalties, normalized evidence confidence.

	### 4.6 Threat Graph — `threat_graph.py`

	A typed knowledge graph tracking all SOC entities:

	- 5 Node Types: `HostNode`, `ProcessNode`, `IOCNode`, `VulnerabilityNode`, `AlertNode`
	- 6 Edge Types: `runs_on`, `involves`, `communicates_with`, `pivoted_from`, `part_of_chain`, `exploits`
	- 200-node cap with LRU IOC pruning
	- Version tracking with changelog for delta queries
	- Evidence confidence computation for plan quality scoring
	- Context summary generation for LLM injection

	### 4.7 SOAR Playbooks — `soar_playbooks.py`

	5 automated response playbooks with prerequisite validation:

	\| Playbook \| Prerequisites \| Sub-Actions \|
	\|----------\|--------------\|-------------\|
	\| `ransomware_containment` \| Forensics run, process identified \| kill_process, block_ioc \|
	\| `c2_disruption` \| IOC enriched, C2 IP identified \| block_ioc, isolate_segment \|
	\| `lateral_movement_lockdown` \| Forensics run, lateral movement detected \| kill_process, isolate_segment \|
	\| `phishing_response` \| Phishing vector confirmed \| enrich_ioc, block_ioc \|
	\| `data_exfil_stop` \| Forensics run, exfil destination identified \| block_ioc, kill_process \|

	### 4.8 Action Validation — `action_validation.py`

	3-gate middleware:
	1. Phase whitelist — Actions restricted by phase (triage/investigation/remediation/report)
	2. Schema validation — Required arguments checked
	3. Graph groundedness — Actions must reference discovered entities (can't block an IOC you haven't seen)

	### 4.9 Tool Router — `tool_router.py`

	Deterministic phase state machine:
	- Phases: `triage` → `investigation` → `remediation` → `report` → `done`
	- Loop limits: max 4 investigation loops, 3 remediation loops
	- Supports pushback — agent can justify staying in a phase with graph references

	Triage Solver: Priority = `severity_weight × criticality_weight × (1 + blast_radius/10)`

	### 4.10 Episode Sandbox — `episode_sandbox.py`

	Safety guardrails:
	- 120-second wall-clock timeout per episode
	- 20-step hard limit per episode
	- State integrity protection — Protected fields (`_task_def`, `_live_requirements`, `_threat_graph`) are snapshot-hashed; mutations are detected and rolled back
	- Hacking detection — Reports any external state tampering

	### 4.11 Adaptive Red Team

	Two mechanisms in `play_environment.py`:

	1. Reinfection (`_maybe_reinfect`): 30% chance when killing a process if IOCs in the chain are unblocked → spawns `process_v2` variant + CRITICAL alert
	2. Lateral Pivot (`_execute_lateral_pivot`): Triggered by isolate/kill actions on hard tasks → copies malware to adjacent healthy host, adds `pivoted_from` edge, emits PIVOT alert, updates live requirements

	Escalation: Probability increases when agent is slow (step > 10 with 0 containments).

	### 4.12 Server Application — `app.py`

	FastAPI app created via OpenEnv's `create_app()`:
	- POST /reset — Reset environment with task_id
	- POST /step — Execute an action
	- GET /state — Get current state
	- WS /ws — WebSocket for persistent sessions
	- CORS enabled for dashboard communication
	- Supports 4 concurrent environment instances

	---

	## 5. Frontend (Dashboard)

	### 5.1 Overview

	A real-time "CyberSOC Command Center" web dashboard with 6 panels, built with vanilla HTML/CSS/JS + D3.js + Chart.js.

	### 5.2 Six Dashboard Panels

	1. Alert Queue — Live SIEM/EDR alerts with severity badges and IOC indicators
	2. Live Threat Graph — D3.js force-directed graph with 5 node types, drag/zoom, glow effects, pivot animation
	3. Agent Actions — Chronological action log with reward tracking
	4. Network Topology — Visual subnet map with compromised/isolated counts
	5. Performance Metrics — Chart.js radar chart (10 dimensions) + cumulative reward timeline
	6. Mission Status — Containment progress bars, business impact gauge, active threat list, episode controls, 🔴 Red Team Toolkit (hotseat multiplayer)

	### 5.3 Visual Design

	- Dark theme with glassmorphism panels
	- Typography: Inter (UI) + JetBrains Mono (data)
	- Color system: Accent colors for cyan, green, amber, red, purple
	- Animations: Count-up numbers, scale bounces, pulse glows, screen flashes
	- Red Team pivot: Screen border flash, toast notification, traveling dot animation on pivot edges

	### 5.4 Key Frontend Components

	`graphs.js` (881 lines):
	- `ClientThreatGraph` — Client-side graph state manager synced from observations
	- `ThreatGraphViz` — D3.js v7 force simulation with SVG glow filters, curved edges, node symbols (circle/diamond/triangle/square/wye), click-to-highlight, drag behavior
	- `RadarChart` — Chart.js 10-axis radar for live grading dimensions
	- `RewardTimeline` — Gradient-filled cumulative reward line chart

	`app.js` (45KB) — Main orchestrator handling episode lifecycle, API calls, UI updates, phase indicator tracking

	`api.js` — REST client with auto-detection of server origin, session management, response parsing

	`animations.js` — Utility library for count-up, screen flash, toast notifications, scale bounce, pulse glow, dramatic final score reveal

	### 5.5 Dashboard Server — `dashboard_server.py`

	Wraps the FastAPI app to also serve the dashboard as static files at `/dashboard/`. Prints a styled ASCII banner on startup.

	---

	## 6. Inference & Training

	### 6.1 Inference Script — `inference.py`

	LLM baseline agent using OpenAI-compatible API:
	- System prompt defines SOC analyst role with all 6 core actions
	- Formats observations into structured text for the LLM
	- Parses JSON actions from LLM responses (with fallback extraction)
	- Runs episodes across easy/medium/hard tasks
	- Emits structured stdout logs: `[START]`, `[STEP]`, `[END]` (hackathon requirement)
	- Default model: `Qwen/Qwen2.5-72B-Instruct` via HuggingFace Router

	### 6.2 GRPO Reward Functions — `training/reward_funcs.py`

	10 TRL-compatible reward functions for Group Relative Policy Optimization:

	```python
	from training.reward_funcs import make_soc_reward_funcs
	reward_fns = make_soc_reward_funcs("http://localhost:8000")
	trainer = GRPOTrainer(model=model, reward_funcs=reward_fns, args=GRPOConfig(...))
	```

	Each function:
	1. Parses completion as JSON action list
	2. Replays actions against live environment server
	3. Returns the specific dimension's score from `grade_breakdown`
	4. Non-parseable completions return 0.0

	### 6.3 Per-Step Reward Dimensions

	The environment computes live partial scores every step (`_compute_reward_dimensions`) for GRPO credit assignment without waiting for the terminal grade. These are exposed in `SOCObservation.reward_dimensions`.

	---

	## 7. Testing

	11 test files covering all major components:

	\| File \| Focus \|
	\|------\|-------\|
	\| `test_integration.py` \| Full episode flows, phase violations, adaptive pivots, 10-dim grading, sandbox limits \|
	\| `test_task1.py` - `test_task9.py` \| Individual task-specific validations \|

	Key integration tests:
	- Easy/medium episodes complete without crashes
	- All 10 action types can be exercised in a single episode
	- Phase violations return negative reward (not crash)
	- Adaptive pivot fires on hard tasks
	- Step rewards accumulate correctly and are idempotent
	- Grader returns exactly 10 dimensions
	- Sandbox step limit raises `EpisodeTimeout`

	---

	## 8. Deployment & DevOps

	### Docker
	- Root Dockerfile — Slim Python 3.10, serves on port 7860 (HuggingFace Spaces)
	- Server Dockerfile — Multi-stage build from `ghcr.io/meta-pytorch/openenv-base`, uses `uv` for dependency management, health check on `/health`

	### Validation
	`validate_submission.sh` — 3-step validator:
	1. Ping HF Space `/reset` endpoint
	2. Docker build succeeds
	3. `openenv validate` passes

	### OpenEnv Manifest
	`openenv.yaml` — 1,003 task definitions with descriptions, max steps, and difficulty tags. Used by the OpenEnv framework for task discovery and benchmarking.

	---

	## 9. Environment Variables

	\| Variable \| Purpose \| Default \|
	\|----------\|---------\|---------\|
	\| `API_BASE_URL` \| LLM API endpoint \| `https://router.huggingface.co/v1` \|
	\| `MODEL_NAME` \| Model identifier \| `Qwen/Qwen2.5-72B-Instruct` \|
	\| `HF_TOKEN` \| HuggingFace API key \| — \|

	---

	## 10. Data Flow

	```mermaid
	sequenceDiagram
	participant Agent as LLM Agent
	participant Inf as inference.py
	participant Env as CyberSOCEnvironment
	participant TG as ThreatGraph
	participant Gr as Grader

	Inf->>Env: reset(task_id="hard")
	Env->>TG: populate from task_def
	Env-->>Inf: SOCObservation (alerts, topology)

	loop Each Step
	Inf->>Agent: format_observation → LLM prompt
	Agent-->>Inf: JSON action
	Inf->>Env: step(SOCActionWrapper)
	Env->>Env: ActionMiddleware.validate()
	Env->>Env: Handle action (query/forensics/kill/etc)
	Env->>TG: Update graph nodes/edges
	Env->>Env: _adversary_react() (adaptive pivot)
	Env->>Env: _compute_reward_dimensions()
	Env-->>Inf: SOCObservation (updated state)
	end

	Inf->>Env: step(submit_containment_plan)
	Env->>Gr: grade_episode(actions, plan, graph, task_def, state)
	Gr-->>Env: {final_score, breakdown[10], penalties, bonuses}
	Env-->>Inf: SOCObservation (done=true, final_score)
	```

	---

	## 11. Red Team Design Philosophy

	The Red Team is NOT a separate LLM agent. It is a deterministic adversarial dynamics engine that defines the environment's state transition function.

	### 7 Behavioral Mechanisms
	1. Reactive Pivoting: Triggers on `isolate_segment` and `kill_process` (copy-not-move spread)
	2. Persistence: Reinfection triggers when a process is killed but its root IOC remains unblocked (teaches causal reasoning)
	3. Time Pressure: Pivot probability escalates +0.2 after step 10 if zero containments are achieved
	4. Controlled Randomness: Uses an episode-scoped `self._rng` (seeded by `task_id`) to ensure deterministic rollouts
	5. Noisy Observations: Benign processes mixed in host data
	6. Escalation: Pivot probabilities scale with difficulty (`Easy: 0.0`, `Medium: 0.3`, `Hard: 0.8`)
	7. Stall Punishment (new): If Blue makes 3+ consecutive passive actions (`query_host` / `pass_turn`) without containment, Red immediately deploys ransomware; plus a 15 % chance to spread laterally even on passive Blue turns

	### Attack Lifecycle Model (MITRE-aligned)
	`Phase 1: Compromise` → `Phase 2: Lateral Movement` → `Phase 3: Persistence` → `Phase 4: Escalation` → `Phase 5: Impact`

	---

	## 12. Reward-Hacking Defense Map

	Per guide §8, we implemented specific defenses against the known RL exploit vectors:

	\| Guide Attack Vector \| Our Defense \|
	\|---\|---\|
	\| Editing timers \| `EpisodeSandbox` wall-clock enforcement \|
	\| Caching results \| Idempotent step rewards via `_fired_step_rewards` \|
	\| Abusing globals \| Instance-scoped RNG + episode-scoped `self._rng` \|
	\| Mutating protected state \| Sandbox hash-snapshot + rollback \|
	\| Exploiting env bugs \| 3-gate validation middleware \|
	\| Reward-function gaming \| Evidence confidence normalization \|
	\| Cheating via blind remediation \| Graph-groundedness gate \|
	\| Blind IOC blocking \| Enrichment-before-block penalty \|

	---

	## 13. Curriculum Learning Strategy

	The 1000+ deterministic scenarios generated by `task_generator.py` are explicitly divided into three difficulty tiers to support Curriculum Learning (Guide §6).

	This exists precisely to satisfy Daniel's Law of RL: The base model must get non-zero reward on Easy before it can meaningfully learn Hard.

	- Phase 1 (Warm-Start): `gen_0001`–`gen_0333` (Easy). Single threat, 15 max steps, 0.0 pivot probability.
	- Phase 2 (Scaling): `gen_0334`–`gen_0666` (Medium). Multi-stage, 25 max steps, 0.3 pivot probability.
	- Phase 3 (Stress-Test): `gen_0667`–`gen_1000` (Hard). APT, 30 max steps, 0.8 pivot probability.

	The adaptive pivot probability is itself a curriculum signal; the environment gets harder as the agent gets better.

	---

	## 14. Intended Training Stack

	CyberSOCEnv is designed for the canonical stack specified in Guide §10:

	1. Unsloth: 4-bit QLoRA loading and efficient inference
	2. TRL: `GRPOTrainer` consuming our 10 independent callable functions via `reward_funcs=[...]`
	3. OpenEnv: WebSocket transport and session isolation
	4. vLLM: Serving the rollout workers for maximum throughput

	A reference adapter module exists at `training/reward_funcs.py` that mirrors the Unsloth 2048 notebook structure 1:1, allowing plug-and-play GRPO training.

	---

	## 15. Anti-Patterns Avoided

	How we avoided the 7 common mistakes listed in Guide §21:

	1. Building before designing env: Env, types, and sandbox were built and tested completely offline before any trainer was attached.
	2. LLM-as-a-judge: CyberSOCEnv uses zero LLM-as-judge signals. Everything is deterministic code against the ThreatGraph.
	3. Single monolithic reward: We use a 10-dimensional verifiable rubric, fed independently into TRL.
	4. Ignoring inference latency: We implemented Graph Delta Injection (~10x fewer tokens) and a sparse-node generation strategy (~75 active nodes) specifically to optimize GRPO rollout latency.
	5. No abuse prevention: 3-gate middleware + EpisodeSandbox explicitly prevent out-of-band cheating.
	6. Delayed deployment: Environment was packaged with Docker and deployed to HF Spaces early.
	7. Scaling prematurely: All 9 components passed integration testing (`test_integration.py` through `test_task9.py`) before scaling to 1000 tasks.

	---

	## 16. Known Deviations & Alignment Items

	While we strive to match the OpenEnv canonical scaffolding (Guide §5), there are a few intentional architectural differences:

	1. Action Dispatch: We use a discriminated union wrapper (`SOCActionWrapper` with a `type` field) rather than a single flat action class. This matches the MCP ToolCall pattern and real SOC work better than a flat action space.
	2. Decoupled Engine: The core logic lives in `server/play_environment.py`, completely separate from the FastAPI transport layer in `server/app.py`. This ensures we can run headless parallel environments during GRPO without HTTP overhead if needed.

	---

	## 17. Team Structure & Role Split

	Per guide §17, responsibilities are split across three functional roles to execute the RL pipeline effectively.

	### Role 1: Environment Engineer
	Mission: Build a deterministic, unhackable, fast environment.
	Owns: `play_environment.py`, `tasks.py`, `threat_graph.py`, `episode_sandbox.py`, `action_validation.py`, `tests/`
	- Scope: Implements the state machine, Red Team behavior, and validates actions. Owns the core `step()` and `reset()` loops. Ensures the environment parses valid inputs and securely handles invalid ones.
	- Hackathon Focus: Bug fixes, latency optimization (graph deltas), sandbox integrity, and procedural scenario generation.

	### Role 2: Reward Engineer
	Mission: Design the mathematical signals that shape model behavior.
	Owns: `graders.py`, `training/reward_funcs.py`, `models.py`
	- Scope: Creates the 10-dimensional verifiable grading logic. Plumbs the environment outputs into TRL-compatible `reward_funcs`. Tunes the penalties to prevent reward hacking (e.g., punishing blind IOC blocking).
	- Hackathon Focus: Ensuring the model gets positive step-rewards early on to prevent it from collapsing, while preventing it from finding "lazy" exploits.

	### Role 3: Training Engineer
	Mission: Execute the GRPO curriculum and produce the final model.
	Owns: `training/` directory, Colab notebooks, `inference.py`
	- Scope: Sets up the actual training loops using Unsloth and TRL. Manages the hyperparameter tuning, LoRA checkpointing, and vLLM inference configuration. Runs the curriculum from Easy to Hard.
	- Hackathon Focus: Capturing the before/after learning curves on held-out tasks to prove to the judges that the environment actually works to train a model.

	---

	## 18. Key Innovations Summary

	\| Innovation \| Description \|
	\|-----------\|-------------\|
	\| Procedural Generation \| SHA-256 seeded RNG generates 1000+ unique deterministic scenarios \|
	\| ThreatGraph \| Typed knowledge graph with version tracking, evidence confidence, and LRU pruning \|
	\| 10-Dim Grading \| Weighted multi-dimensional scoring replacing binary pass/fail \|
	\| Adaptive Red Team \| Attacker reacts to defender actions — lateral pivots and reinfection \|
	\| SOAR Playbooks \| Prerequisite-gated automated response workflows \|
	\| 3-Gate Validation \| Phase whitelist + schema + graph-groundedness prevents invalid actions \|
	\| Episode Sandbox \| State integrity protection with hash-based tampering detection \|
	\| Live GRPO Signals \| Per-step reward dimensions for RL credit assignment \|
	\| Anti-Gaming \| Blind-blocking penalties, over-isolation cap, idempotent step rewards (0.40 cap) \|
	\| Real-time Dashboard \| D3.js threat graph with pivot animations and 10-dim radar chart \|
	\| Hotseat Multiplayer \| Human Red Team player via in-dashboard toolkit; FSP backend; per-turn UI lock \|
	\| Stall Punishment \| 3 consecutive passive Blue actions triggers ransomware deploy + 15 % autonomous pivot \|
	\| Emergency Gate \| `isolate_segment` in triage requires a critical alert; `UNJUSTIFIED_EMERGENCY` → -0.15 penalty \|
	\| External Intel Feed \| `task_def.external_intel_feed` IOCs injected at reset — immediately blockable/enrichable \|
	\| Doomsday Clock \| `state.business_impact × 0.30` applied as a direct score modifier; 90 % negligence crush when no threats contained \|

	---

	## 19. Technology Stack

	\| Layer \| Technologies \|
	\|-------\|-------------\|
	\| Backend \| Python 3.10+, FastAPI, Uvicorn, Pydantic v2, OpenEnv Core \|
	\| Frontend \| Vanilla HTML/CSS/JS, D3.js v7, Chart.js v4, Inter/JetBrains Mono fonts \|
	\| Inference \| OpenAI Python SDK, asyncio \|
	\| Training \| TRL (Hugging Face), GRPO \|
	\| DevOps \| Docker (multi-stage), uv package manager, pytest \|
	\| Deployment \| HuggingFace Spaces (Docker SDK) \|
	\| Visualization \| NetworkX + Matplotlib (server-side PNG), D3.js (client-side interactive) \|