title: CyberSOC Upgraded RLVR
emoji: π‘οΈ
colorFrom: red
colorTo: purple
sdk: docker
pinned: false
CyberSOC: Complete Project Review
New β Hotseat Multiplayer Mode: A human can now play the Red Team live from the dashboard. After every Blue action the UI pauses, enables the π΄ Red Team Toolkit, shows a
BLUE / RED TEAM TURNindicator in the header, and resumes Blue auto-play only after the Red player submits their move. Backend now runs in FSP (Fictitious Self-Play) mode by default.
New β 4 Architectural Upgrades: Lazy Adversary (stall punishment + 15 % autonomous pivot), Rigid Bureaucracy emergency-isolation gate (UNJUSTIFIED_EMERGENCY penalty), Siloed Intelligence external intel feed injection, and Depressed Analyst doomsday-clock direct modifier + negligence penalty.
1. Project Overview β RLVR Positioning
CyberSOC (CyberSOCEnv) is an RLVR-stage reinforcement learning environment that sits at the final rung of the model-maturation arc: Random Init β Pretraining β SFT/IFT β Preference FT β RLVR. It does not pretrain, supervise, or preference-align β it consumes a base model that has already been through those stages and turns its agentic actions into a dense, verifiable, 10-dimensional reward signal that GRPO can train on.
It assumes a base model that has already been SFT-aligned. The environment itself does not perform SFT; it is an RL-only artifact. This satisfies Daniel's "Law of RL": The base model must get non-zero reward on Easy before it can meaningfully learn Hard.
Built for: The OpenEnv Hackathon (Meta Platforms) Framework: OpenEnv (Meta's RL environment framework) License: BSD-style (Meta Platforms, Inc.)
Guide Alignment Summary
| Guide Section | Requirement | Our Implementation |
|---|---|---|
| Β§1 | Step-by-step, programmatic verification, hard-but-possible | 10 typed actions, 10-dim deterministic grader, EasyβHard curriculum |
| Β§4 | Design env before trainer | Env designed first; reset/step/state as first-class artifacts |
| Β§6 | Keep task simple at first | 1000+ scenarios across 3 difficulty tiers enable curriculum learning |
| Β§7 | Multiple independent reward functions | 10 dimensions consumed as reward_funcs=[...] by GRPOTrainer |
| Β§8 | Protect against reward hacking | 8 distinct defenses mapped to guide's attack vectors |
| Β§10 | Right training stack | Unsloth (QLoRA) + TRL (GRPO) + OpenEnv (transport) |
| Β§11 | Prefer GRPO/RLVR | RLVR throughout; every reward is deterministic code (zero LLM-as-judge) |
| Β§12 | Keep inference fast | Graph-delta injection + sparse nodes = rollout-latency optimizations |
| Β§14 | Scale only after stable | All 9 components passed integration before any GRPO rollout |
2. Core Idea & Innovation
The Problem
Traditional cybersecurity training environments use static puzzles with fixed answers. Real SOC work requires dynamic reasoning under time pressure with incomplete information.
The Solution
CyberSOC creates a fully dynamic, deterministic SOC simulation with:
- Procedural Scenario Generation β 1,003 unique attack scenarios (3 curated + 1,000 generated) from seed-based deterministic generation. Same seed = same scenario, enabling reproducible RL training.
- 13 Threat Categories β Ransomware, Phishing, Credential Theft, Lateral Movement, C2 Communication, Privilege Escalation, Data Exfiltration, Cryptomining, Supply Chain, Insider Threat, Webshell, Botnet, Malware.
- Adaptive Red Team β An adversary that reacts to agent actions: if you isolate a host, the attacker may pivot laterally. If you kill a process without blocking IOCs, it may reinfect with a
_v2variant. - 10-Dimensional Grading β Not a binary pass/fail. Agents are scored across 10 weighted dimensions for nuanced RL credit assignment. Zero LLM-as-a-judge.
- Business Continuity Constraints β Rash actions (isolating clean subnets, killing legitimate processes) cause business downtime penalties.
- TRL GRPO Integration β 10 reward functions that plug directly into Hugging Face's TRL
GRPOTrainerfor RL fine-tuning.
3. Architecture
MetaRound2/
βββ models.py # Pydantic data models (Observation, Action, State)
βββ client.py # WebSocket client for agent interaction
βββ __init__.py # Package exports
βββ inference.py # LLM baseline inference script
βββ dashboard_server.py # Dashboard + API server launcher
βββ pyproject.toml # Python package config
βββ Dockerfile # HuggingFace Spaces deployment
βββ openenv.yaml # 1003 task manifest
βββ validate_submission.sh # Hackathon submission validator
β
βββ server/ # Backend environment engine
β βββ app.py # FastAPI application entry point
β βββ play_environment.py # Core environment (1284 lines)
β βββ tasks.py # Hand-crafted task definitions (easy/medium/hard)
β βββ task_generator.py # Procedural generation engine (1000+ tasks)
β βββ graders.py # 10-dimensional grading system
β βββ threat_graph.py # Typed knowledge graph
β βββ soar_playbooks.py # 5 SOAR playbook definitions
β βββ action_validation.py # 3-gate action validation middleware
β βββ tool_router.py # Phase state machine + triage solver
β βββ episode_sandbox.py # Wall-clock + step-limit guard
β βββ visualize_graph.py # PNG graph renderer (matplotlib/networkx)
β βββ Dockerfile # Multi-stage Docker build
β
βββ training/ # RL training integration
β βββ reward_funcs.py # 10 TRL GRPO reward functions
β
βββ dashboard/ # Real-time web dashboard
β βββ index.html # Main HTML (6 panels)
β βββ css/styles.css # Dark theme CSS (25KB)
β βββ js/
β βββ app.js # Main dashboard logic (45KB)
β βββ graphs.js # D3.js threat graph + Chart.js (31KB)
β βββ api.js # REST API client
β βββ animations.js # Micro-animations & effects
β
βββ tests/ # 10 test files + integration suite
βββ test_integration.py
βββ test_task1.py ... test_task9.py
4. Backend (Server)
4.1 Core Environment β play_environment.py
The heart of the project. CyberSOCEnvironment extends OpenEnv's Environment interface.
Key features:
reset(task_id)β Builds the network, injects attack chains, initializes alert queue, seeds the ThreatGraphstep(action)β Processes one agent action, computes rewards, updates state, triggers adaptive adversary- Concurrent sessions β Each WebSocket connection gets its own environment instance
- ActionMiddleware β Pre-flight validation (phase violations, graph-groundedness) before consuming a step
10 Agent Actions:
| # | Action | Purpose | Reward Range |
|---|---|---|---|
| 1 | query_host |
Map architecture, get endpoint info | -0.05 to +0.05 |
| 2 | run_forensics |
Deep system artifact extraction | -0.02 to +0.10 |
| 3 | kill_process |
Terminate malicious execution | -0.08 to +0.25 |
| 4 | block_ioc |
Blacklist IOCs network-wide | -0.03 to +0.15 |
| 5 | isolate_segment |
Quarantine subnet or host | -0.10 to +0.15 |
| 6 | correlate_alerts |
Find shared entities across alerts | Β±0.05 |
| 7 | enrich_ioc |
Threat-intel enrichment (actor, TTPs) | Β±0.05 |
| 8 | scan_host_vulnerabilities |
Discover CVEs on a host | Β±0.05 |
| 9 | trigger_playbook |
Execute SOAR automated response | Β±0.10 |
| 10 | submit_containment_plan |
Final report β ends episode | 0.0 to 1.0 |
4.2 Data Models β models.py
All data flows through strict Pydantic models (429 lines):
- Enums:
Severity,ThreatType(13 types),HostStatus,SubnetRole(6 roles) - Sub-models:
Alert,HostInfo,NetworkTopology,ForensicsResult,TimelineEntry SOCObservation(extends OpenEnvObservation): 20+ fields includingalert_queue,network_topology,host_forensics,threat_graph_summary,reward_dimensions,available_playbooks- Actions: Discriminated union of 10 action types via
SOCActionWrapper SOCState(internal): Tracks all episode state β killed processes, blocked IOCs, isolated subnets, etc.
4.3 Task Definitions β tasks.py
Three hand-crafted benchmark scenarios:
| Task | Threats | Hosts | Max Steps | Description |
|---|---|---|---|---|
| Easy | 1 | 1 | 15 | Single ransomware on WS-042 |
| Medium | 3 | 4 | 25 | Phishing β credential theft β lateral movement across 3 subnets |
| Hard | 5 | 7 | 30 | Full APT: phishing β C2 β privesc β exfil β ransomware |
Network: ~75 active hosts across 6 subnets (corporate, engineering, finance, DMZ, datacenter, executive) with realistic processes, ports, and criticality scores.
4.4 Procedural Task Generator β task_generator.py
Generates 1,000+ unique deterministic scenarios from a seed:
hash(task_id)β deterministicrandom.Randomseed β drives ALL choices- Template pools: 90+ malware process names, 40 C2 domains, 36 C2 IPs, 12 ransomware extensions, 12 data types
- 3 difficulty tiers: Easy (1 threat), Medium (2-3 threats, multi-stage chains), Hard (3-6 threats, APT campaigns)
- Alert generation: Templated descriptions with randomized details (timestamps, file counts, data sizes)
4.5 Grading System β graders.py
10-dimensional weighted grading:
| Dimension | Weight | What It Measures |
|---|---|---|
threat_containment |
0.20 | Fraction of required process kills completed |
ioc_blocking |
0.12 | Fraction of known IOCs blocked (penalizes blind blocking) |
forensic_investigation |
0.10 | Compromised hosts examined |
siem_correlation |
0.08 | Whether alerts were correlated (bonus for early correlation) |
threat_intel_usage |
0.08 | IOCs enriched with threat intel |
vuln_root_cause |
0.08 | CVE root causes discovered (bonus if cited in plan) |
business_impact |
0.10 | Penalizes unnecessary isolation and over-isolation (>20% = -0.30) |
step_efficiency |
0.07 | Rewards SOAR playbook usage, penalizes step overrun |
plan_coverage |
0.10 | Threats addressed in final plan |
plan_evidence_quality |
0.07 | Evidence confidence from ThreatGraph |
Anti-gaming: Per-occurrence penalty cap (Β±0.15), blind-blocking penalties, normalized evidence confidence.
4.6 Threat Graph β threat_graph.py
A typed knowledge graph tracking all SOC entities:
- 5 Node Types:
HostNode,ProcessNode,IOCNode,VulnerabilityNode,AlertNode - 6 Edge Types:
runs_on,involves,communicates_with,pivoted_from,part_of_chain,exploits - 200-node cap with LRU IOC pruning
- Version tracking with changelog for delta queries
- Evidence confidence computation for plan quality scoring
- Context summary generation for LLM injection
4.7 SOAR Playbooks β soar_playbooks.py
5 automated response playbooks with prerequisite validation:
| Playbook | Prerequisites | Sub-Actions |
|---|---|---|
ransomware_containment |
Forensics run, process identified | kill_process, block_ioc |
c2_disruption |
IOC enriched, C2 IP identified | block_ioc, isolate_segment |
lateral_movement_lockdown |
Forensics run, lateral movement detected | kill_process, isolate_segment |
phishing_response |
Phishing vector confirmed | enrich_ioc, block_ioc |
data_exfil_stop |
Forensics run, exfil destination identified | block_ioc, kill_process |
4.8 Action Validation β action_validation.py
3-gate middleware:
- Phase whitelist β Actions restricted by phase (triage/investigation/remediation/report)
- Schema validation β Required arguments checked
- Graph groundedness β Actions must reference discovered entities (can't block an IOC you haven't seen)
4.9 Tool Router β tool_router.py
Deterministic phase state machine:
- Phases:
triageβinvestigationβremediationβreportβdone - Loop limits: max 4 investigation loops, 3 remediation loops
- Supports pushback β agent can justify staying in a phase with graph references
Triage Solver: Priority = severity_weight Γ criticality_weight Γ (1 + blast_radius/10)
4.10 Episode Sandbox β episode_sandbox.py
Safety guardrails:
- 120-second wall-clock timeout per episode
- 20-step hard limit per episode
- State integrity protection β Protected fields (
_task_def,_live_requirements,_threat_graph) are snapshot-hashed; mutations are detected and rolled back - Hacking detection β Reports any external state tampering
4.11 Adaptive Red Team
Two mechanisms in play_environment.py:
- Reinfection (
_maybe_reinfect): 30% chance when killing a process if IOCs in the chain are unblocked β spawnsprocess_v2variant + CRITICAL alert - Lateral Pivot (
_execute_lateral_pivot): Triggered by isolate/kill actions on hard tasks β copies malware to adjacent healthy host, addspivoted_fromedge, emits PIVOT alert, updates live requirements
Escalation: Probability increases when agent is slow (step > 10 with 0 containments).
4.12 Server Application β app.py
FastAPI app created via OpenEnv's create_app():
- POST /reset β Reset environment with task_id
- POST /step β Execute an action
- GET /state β Get current state
- WS /ws β WebSocket for persistent sessions
- CORS enabled for dashboard communication
- Supports 4 concurrent environment instances
5. Frontend (Dashboard)
5.1 Overview
A real-time "CyberSOC Command Center" web dashboard with 6 panels, built with vanilla HTML/CSS/JS + D3.js + Chart.js.
5.2 Six Dashboard Panels
- Alert Queue β Live SIEM/EDR alerts with severity badges and IOC indicators
- Live Threat Graph β D3.js force-directed graph with 5 node types, drag/zoom, glow effects, pivot animation
- Agent Actions β Chronological action log with reward tracking
- Network Topology β Visual subnet map with compromised/isolated counts
- Performance Metrics β Chart.js radar chart (10 dimensions) + cumulative reward timeline
- Mission Status β Containment progress bars, business impact gauge, active threat list, episode controls, π΄ Red Team Toolkit (hotseat multiplayer)
5.3 Visual Design
- Dark theme with glassmorphism panels
- Typography: Inter (UI) + JetBrains Mono (data)
- Color system: Accent colors for cyan, green, amber, red, purple
- Animations: Count-up numbers, scale bounces, pulse glows, screen flashes
- Red Team pivot: Screen border flash, toast notification, traveling dot animation on pivot edges
5.4 Key Frontend Components
graphs.js (881 lines):
ClientThreatGraphβ Client-side graph state manager synced from observationsThreatGraphVizβ D3.js v7 force simulation with SVG glow filters, curved edges, node symbols (circle/diamond/triangle/square/wye), click-to-highlight, drag behaviorRadarChartβ Chart.js 10-axis radar for live grading dimensionsRewardTimelineβ Gradient-filled cumulative reward line chart
app.js (45KB) β Main orchestrator handling episode lifecycle, API calls, UI updates, phase indicator tracking
api.js β REST client with auto-detection of server origin, session management, response parsing
animations.js β Utility library for count-up, screen flash, toast notifications, scale bounce, pulse glow, dramatic final score reveal
5.5 Dashboard Server β dashboard_server.py
Wraps the FastAPI app to also serve the dashboard as static files at /dashboard/. Prints a styled ASCII banner on startup.
6. Inference & Training
6.1 Inference Script β inference.py
LLM baseline agent using OpenAI-compatible API:
- System prompt defines SOC analyst role with all 6 core actions
- Formats observations into structured text for the LLM
- Parses JSON actions from LLM responses (with fallback extraction)
- Runs episodes across easy/medium/hard tasks
- Emits structured stdout logs:
[START],[STEP],[END](hackathon requirement) - Default model:
Qwen/Qwen2.5-72B-Instructvia HuggingFace Router
6.2 GRPO Reward Functions β training/reward_funcs.py
10 TRL-compatible reward functions for Group Relative Policy Optimization:
from training.reward_funcs import make_soc_reward_funcs
reward_fns = make_soc_reward_funcs("http://localhost:8000")
trainer = GRPOTrainer(model=model, reward_funcs=reward_fns, args=GRPOConfig(...))
Each function:
- Parses completion as JSON action list
- Replays actions against live environment server
- Returns the specific dimension's score from
grade_breakdown - Non-parseable completions return 0.0
6.3 Per-Step Reward Dimensions
The environment computes live partial scores every step (_compute_reward_dimensions) for GRPO credit assignment without waiting for the terminal grade. These are exposed in SOCObservation.reward_dimensions.
7. Testing
11 test files covering all major components:
| File | Focus |
|---|---|
test_integration.py |
Full episode flows, phase violations, adaptive pivots, 10-dim grading, sandbox limits |
test_task1.py - test_task9.py |
Individual task-specific validations |
Key integration tests:
- Easy/medium episodes complete without crashes
- All 10 action types can be exercised in a single episode
- Phase violations return negative reward (not crash)
- Adaptive pivot fires on hard tasks
- Step rewards accumulate correctly and are idempotent
- Grader returns exactly 10 dimensions
- Sandbox step limit raises
EpisodeTimeout
8. Deployment & DevOps
Docker
- Root Dockerfile β Slim Python 3.10, serves on port 7860 (HuggingFace Spaces)
- Server Dockerfile β Multi-stage build from
ghcr.io/meta-pytorch/openenv-base, usesuvfor dependency management, health check on/health
Validation
validate_submission.sh β 3-step validator:
- Ping HF Space
/resetendpoint - Docker build succeeds
openenv validatepasses
OpenEnv Manifest
openenv.yaml β 1,003 task definitions with descriptions, max steps, and difficulty tags. Used by the OpenEnv framework for task discovery and benchmarking.
9. Environment Variables
| Variable | Purpose | Default |
|---|---|---|
API_BASE_URL |
LLM API endpoint | https://router.huggingface.co/v1 |
MODEL_NAME |
Model identifier | Qwen/Qwen2.5-72B-Instruct |
HF_TOKEN |
HuggingFace API key | β |
10. Data Flow
sequenceDiagram
participant Agent as LLM Agent
participant Inf as inference.py
participant Env as CyberSOCEnvironment
participant TG as ThreatGraph
participant Gr as Grader
Inf->>Env: reset(task_id="hard")
Env->>TG: populate from task_def
Env-->>Inf: SOCObservation (alerts, topology)
loop Each Step
Inf->>Agent: format_observation β LLM prompt
Agent-->>Inf: JSON action
Inf->>Env: step(SOCActionWrapper)
Env->>Env: ActionMiddleware.validate()
Env->>Env: Handle action (query/forensics/kill/etc)
Env->>TG: Update graph nodes/edges
Env->>Env: _adversary_react() (adaptive pivot)
Env->>Env: _compute_reward_dimensions()
Env-->>Inf: SOCObservation (updated state)
end
Inf->>Env: step(submit_containment_plan)
Env->>Gr: grade_episode(actions, plan, graph, task_def, state)
Gr-->>Env: {final_score, breakdown[10], penalties, bonuses}
Env-->>Inf: SOCObservation (done=true, final_score)
11. Red Team Design Philosophy
The Red Team is NOT a separate LLM agent. It is a deterministic adversarial dynamics engine that defines the environment's state transition function.
7 Behavioral Mechanisms
- Reactive Pivoting: Triggers on
isolate_segmentandkill_process(copy-not-move spread) - Persistence: Reinfection triggers when a process is killed but its root IOC remains unblocked (teaches causal reasoning)
- Time Pressure: Pivot probability escalates +0.2 after step 10 if zero containments are achieved
- Controlled Randomness: Uses an episode-scoped
self._rng(seeded bytask_id) to ensure deterministic rollouts - Noisy Observations: Benign processes mixed in host data
- Escalation: Pivot probabilities scale with difficulty (
Easy: 0.0,Medium: 0.3,Hard: 0.8) - Stall Punishment (new): If Blue makes 3+ consecutive passive actions (
query_host/pass_turn) without containment, Red immediately deploys ransomware; plus a 15 % chance to spread laterally even on passive Blue turns
Attack Lifecycle Model (MITRE-aligned)
Phase 1: Compromise β Phase 2: Lateral Movement β Phase 3: Persistence β Phase 4: Escalation β Phase 5: Impact
12. Reward-Hacking Defense Map
Per guide Β§8, we implemented specific defenses against the known RL exploit vectors:
| Guide Attack Vector | Our Defense |
|---|---|
| Editing timers | EpisodeSandbox wall-clock enforcement |
| Caching results | Idempotent step rewards via _fired_step_rewards |
| Abusing globals | Instance-scoped RNG + episode-scoped self._rng |
| Mutating protected state | Sandbox hash-snapshot + rollback |
| Exploiting env bugs | 3-gate validation middleware |
| Reward-function gaming | Evidence confidence normalization |
| Cheating via blind remediation | Graph-groundedness gate |
| Blind IOC blocking | Enrichment-before-block penalty |
13. Curriculum Learning Strategy
The 1000+ deterministic scenarios generated by task_generator.py are explicitly divided into three difficulty tiers to support Curriculum Learning (Guide Β§6).
This exists precisely to satisfy Daniel's Law of RL: The base model must get non-zero reward on Easy before it can meaningfully learn Hard.
- Phase 1 (Warm-Start):
gen_0001βgen_0333(Easy). Single threat, 15 max steps, 0.0 pivot probability. - Phase 2 (Scaling):
gen_0334βgen_0666(Medium). Multi-stage, 25 max steps, 0.3 pivot probability. - Phase 3 (Stress-Test):
gen_0667βgen_1000(Hard). APT, 30 max steps, 0.8 pivot probability.
The adaptive pivot probability is itself a curriculum signal; the environment gets harder as the agent gets better.
14. Intended Training Stack
CyberSOCEnv is designed for the canonical stack specified in Guide Β§10:
- Unsloth: 4-bit QLoRA loading and efficient inference
- TRL:
GRPOTrainerconsuming our 10 independent callable functions viareward_funcs=[...] - OpenEnv: WebSocket transport and session isolation
- vLLM: Serving the rollout workers for maximum throughput
A reference adapter module exists at training/reward_funcs.py that mirrors the Unsloth 2048 notebook structure 1:1, allowing plug-and-play GRPO training.
15. Anti-Patterns Avoided
How we avoided the 7 common mistakes listed in Guide Β§21:
- Building before designing env: Env, types, and sandbox were built and tested completely offline before any trainer was attached.
- LLM-as-a-judge: CyberSOCEnv uses zero LLM-as-judge signals. Everything is deterministic code against the ThreatGraph.
- Single monolithic reward: We use a 10-dimensional verifiable rubric, fed independently into TRL.
- Ignoring inference latency: We implemented Graph Delta Injection (
10x fewer tokens) and a sparse-node generation strategy (75 active nodes) specifically to optimize GRPO rollout latency. - No abuse prevention: 3-gate middleware + EpisodeSandbox explicitly prevent out-of-band cheating.
- Delayed deployment: Environment was packaged with Docker and deployed to HF Spaces early.
- Scaling prematurely: All 9 components passed integration testing (
test_integration.pythroughtest_task9.py) before scaling to 1000 tasks.
16. Known Deviations & Alignment Items
While we strive to match the OpenEnv canonical scaffolding (Guide Β§5), there are a few intentional architectural differences:
- Action Dispatch: We use a discriminated union wrapper (
SOCActionWrapperwith atypefield) rather than a single flat action class. This matches the MCP ToolCall pattern and real SOC work better than a flat action space. - Decoupled Engine: The core logic lives in
server/play_environment.py, completely separate from the FastAPI transport layer inserver/app.py. This ensures we can run headless parallel environments during GRPO without HTTP overhead if needed.
17. Team Structure & Role Split
Per guide Β§17, responsibilities are split across three functional roles to execute the RL pipeline effectively.
Role 1: Environment Engineer
Mission: Build a deterministic, unhackable, fast environment.
Owns: play_environment.py, tasks.py, threat_graph.py, episode_sandbox.py, action_validation.py, tests/
- Scope: Implements the state machine, Red Team behavior, and validates actions. Owns the core
step()andreset()loops. Ensures the environment parses valid inputs and securely handles invalid ones. - Hackathon Focus: Bug fixes, latency optimization (graph deltas), sandbox integrity, and procedural scenario generation.
Role 2: Reward Engineer
Mission: Design the mathematical signals that shape model behavior.
Owns: graders.py, training/reward_funcs.py, models.py
- Scope: Creates the 10-dimensional verifiable grading logic. Plumbs the environment outputs into TRL-compatible
reward_funcs. Tunes the penalties to prevent reward hacking (e.g., punishing blind IOC blocking). - Hackathon Focus: Ensuring the model gets positive step-rewards early on to prevent it from collapsing, while preventing it from finding "lazy" exploits.
Role 3: Training Engineer
Mission: Execute the GRPO curriculum and produce the final model.
Owns: training/ directory, Colab notebooks, inference.py
- Scope: Sets up the actual training loops using Unsloth and TRL. Manages the hyperparameter tuning, LoRA checkpointing, and vLLM inference configuration. Runs the curriculum from Easy to Hard.
- Hackathon Focus: Capturing the before/after learning curves on held-out tasks to prove to the judges that the environment actually works to train a model.
18. Key Innovations Summary
| Innovation | Description |
|---|---|
| Procedural Generation | SHA-256 seeded RNG generates 1000+ unique deterministic scenarios |
| ThreatGraph | Typed knowledge graph with version tracking, evidence confidence, and LRU pruning |
| 10-Dim Grading | Weighted multi-dimensional scoring replacing binary pass/fail |
| Adaptive Red Team | Attacker reacts to defender actions β lateral pivots and reinfection |
| SOAR Playbooks | Prerequisite-gated automated response workflows |
| 3-Gate Validation | Phase whitelist + schema + graph-groundedness prevents invalid actions |
| Episode Sandbox | State integrity protection with hash-based tampering detection |
| Live GRPO Signals | Per-step reward dimensions for RL credit assignment |
| Anti-Gaming | Blind-blocking penalties, over-isolation cap, idempotent step rewards (0.40 cap) |
| Real-time Dashboard | D3.js threat graph with pivot animations and 10-dim radar chart |
| Hotseat Multiplayer | Human Red Team player via in-dashboard toolkit; FSP backend; per-turn UI lock |
| Stall Punishment | 3 consecutive passive Blue actions triggers ransomware deploy + 15 % autonomous pivot |
| Emergency Gate | isolate_segment in triage requires a critical alert; UNJUSTIFIED_EMERGENCY β -0.15 penalty |
| External Intel Feed | task_def.external_intel_feed IOCs injected at reset β immediately blockable/enrichable |
| Doomsday Clock | state.business_impact Γ 0.30 applied as a direct score modifier; 90 % negligence crush when no threats contained |
19. Technology Stack
| Layer | Technologies |
|---|---|
| Backend | Python 3.10+, FastAPI, Uvicorn, Pydantic v2, OpenEnv Core |
| Frontend | Vanilla HTML/CSS/JS, D3.js v7, Chart.js v4, Inter/JetBrains Mono fonts |
| Inference | OpenAI Python SDK, asyncio |
| Training | TRL (Hugging Face), GRPO |
| DevOps | Docker (multi-stage), uv package manager, pytest |
| Deployment | HuggingFace Spaces (Docker SDK) |
| Visualization | NetworkX + Matplotlib (server-side PNG), D3.js (client-side interactive) |