Demo / README.md
Ajayyy00
Update README/Space card for hotseat multiplayer and 4 architectural upgrades
4217376
metadata
title: CyberSOC Upgraded RLVR
emoji: πŸ›‘οΈ
colorFrom: red
colorTo: purple
sdk: docker
pinned: false

CyberSOC: Complete Project Review

New β€” Hotseat Multiplayer Mode: A human can now play the Red Team live from the dashboard. After every Blue action the UI pauses, enables the πŸ”΄ Red Team Toolkit, shows a BLUE / RED TEAM TURN indicator in the header, and resumes Blue auto-play only after the Red player submits their move. Backend now runs in FSP (Fictitious Self-Play) mode by default.

New β€” 4 Architectural Upgrades: Lazy Adversary (stall punishment + 15 % autonomous pivot), Rigid Bureaucracy emergency-isolation gate (UNJUSTIFIED_EMERGENCY penalty), Siloed Intelligence external intel feed injection, and Depressed Analyst doomsday-clock direct modifier + negligence penalty.

1. Project Overview β€” RLVR Positioning

CyberSOC (CyberSOCEnv) is an RLVR-stage reinforcement learning environment that sits at the final rung of the model-maturation arc: Random Init β†’ Pretraining β†’ SFT/IFT β†’ Preference FT β†’ RLVR. It does not pretrain, supervise, or preference-align β€” it consumes a base model that has already been through those stages and turns its agentic actions into a dense, verifiable, 10-dimensional reward signal that GRPO can train on.

It assumes a base model that has already been SFT-aligned. The environment itself does not perform SFT; it is an RL-only artifact. This satisfies Daniel's "Law of RL": The base model must get non-zero reward on Easy before it can meaningfully learn Hard.

Built for: The OpenEnv Hackathon (Meta Platforms) Framework: OpenEnv (Meta's RL environment framework) License: BSD-style (Meta Platforms, Inc.)

Guide Alignment Summary

Guide Section Requirement Our Implementation
§1 Step-by-step, programmatic verification, hard-but-possible 10 typed actions, 10-dim deterministic grader, Easy→Hard curriculum
Β§4 Design env before trainer Env designed first; reset/step/state as first-class artifacts
Β§6 Keep task simple at first 1000+ scenarios across 3 difficulty tiers enable curriculum learning
Β§7 Multiple independent reward functions 10 dimensions consumed as reward_funcs=[...] by GRPOTrainer
Β§8 Protect against reward hacking 8 distinct defenses mapped to guide's attack vectors
Β§10 Right training stack Unsloth (QLoRA) + TRL (GRPO) + OpenEnv (transport)
Β§11 Prefer GRPO/RLVR RLVR throughout; every reward is deterministic code (zero LLM-as-judge)
Β§12 Keep inference fast Graph-delta injection + sparse nodes = rollout-latency optimizations
Β§14 Scale only after stable All 9 components passed integration before any GRPO rollout

2. Core Idea & Innovation

The Problem

Traditional cybersecurity training environments use static puzzles with fixed answers. Real SOC work requires dynamic reasoning under time pressure with incomplete information.

The Solution

CyberSOC creates a fully dynamic, deterministic SOC simulation with:

  1. Procedural Scenario Generation β€” 1,003 unique attack scenarios (3 curated + 1,000 generated) from seed-based deterministic generation. Same seed = same scenario, enabling reproducible RL training.
  2. 13 Threat Categories β€” Ransomware, Phishing, Credential Theft, Lateral Movement, C2 Communication, Privilege Escalation, Data Exfiltration, Cryptomining, Supply Chain, Insider Threat, Webshell, Botnet, Malware.
  3. Adaptive Red Team β€” An adversary that reacts to agent actions: if you isolate a host, the attacker may pivot laterally. If you kill a process without blocking IOCs, it may reinfect with a _v2 variant.
  4. 10-Dimensional Grading β€” Not a binary pass/fail. Agents are scored across 10 weighted dimensions for nuanced RL credit assignment. Zero LLM-as-a-judge.
  5. Business Continuity Constraints β€” Rash actions (isolating clean subnets, killing legitimate processes) cause business downtime penalties.
  6. TRL GRPO Integration β€” 10 reward functions that plug directly into Hugging Face's TRL GRPOTrainer for RL fine-tuning.

3. Architecture

MetaRound2/
β”œβ”€β”€ models.py              # Pydantic data models (Observation, Action, State)
β”œβ”€β”€ client.py              # WebSocket client for agent interaction
β”œβ”€β”€ __init__.py            # Package exports
β”œβ”€β”€ inference.py           # LLM baseline inference script
β”œβ”€β”€ dashboard_server.py    # Dashboard + API server launcher
β”œβ”€β”€ pyproject.toml         # Python package config
β”œβ”€β”€ Dockerfile             # HuggingFace Spaces deployment
β”œβ”€β”€ openenv.yaml           # 1003 task manifest
β”œβ”€β”€ validate_submission.sh # Hackathon submission validator
β”‚
β”œβ”€β”€ server/                # Backend environment engine
β”‚   β”œβ”€β”€ app.py             # FastAPI application entry point
β”‚   β”œβ”€β”€ play_environment.py # Core environment (1284 lines)
β”‚   β”œβ”€β”€ tasks.py           # Hand-crafted task definitions (easy/medium/hard)
β”‚   β”œβ”€β”€ task_generator.py  # Procedural generation engine (1000+ tasks)
β”‚   β”œβ”€β”€ graders.py         # 10-dimensional grading system
β”‚   β”œβ”€β”€ threat_graph.py    # Typed knowledge graph
β”‚   β”œβ”€β”€ soar_playbooks.py  # 5 SOAR playbook definitions
β”‚   β”œβ”€β”€ action_validation.py # 3-gate action validation middleware
β”‚   β”œβ”€β”€ tool_router.py     # Phase state machine + triage solver
β”‚   β”œβ”€β”€ episode_sandbox.py # Wall-clock + step-limit guard
β”‚   β”œβ”€β”€ visualize_graph.py # PNG graph renderer (matplotlib/networkx)
β”‚   └── Dockerfile         # Multi-stage Docker build
β”‚
β”œβ”€β”€ training/              # RL training integration
β”‚   └── reward_funcs.py    # 10 TRL GRPO reward functions
β”‚
β”œβ”€β”€ dashboard/             # Real-time web dashboard
β”‚   β”œβ”€β”€ index.html         # Main HTML (6 panels)
β”‚   β”œβ”€β”€ css/styles.css     # Dark theme CSS (25KB)
β”‚   └── js/
β”‚       β”œβ”€β”€ app.js         # Main dashboard logic (45KB)
β”‚       β”œβ”€β”€ graphs.js      # D3.js threat graph + Chart.js (31KB)
β”‚       β”œβ”€β”€ api.js         # REST API client
β”‚       └── animations.js  # Micro-animations & effects
β”‚
└── tests/                 # 10 test files + integration suite
    β”œβ”€β”€ test_integration.py
    └── test_task1.py ... test_task9.py

4. Backend (Server)

4.1 Core Environment β€” play_environment.py

The heart of the project. CyberSOCEnvironment extends OpenEnv's Environment interface.

Key features:

  • reset(task_id) β€” Builds the network, injects attack chains, initializes alert queue, seeds the ThreatGraph
  • step(action) β€” Processes one agent action, computes rewards, updates state, triggers adaptive adversary
  • Concurrent sessions β€” Each WebSocket connection gets its own environment instance
  • ActionMiddleware β€” Pre-flight validation (phase violations, graph-groundedness) before consuming a step

10 Agent Actions:

# Action Purpose Reward Range
1 query_host Map architecture, get endpoint info -0.05 to +0.05
2 run_forensics Deep system artifact extraction -0.02 to +0.10
3 kill_process Terminate malicious execution -0.08 to +0.25
4 block_ioc Blacklist IOCs network-wide -0.03 to +0.15
5 isolate_segment Quarantine subnet or host -0.10 to +0.15
6 correlate_alerts Find shared entities across alerts Β±0.05
7 enrich_ioc Threat-intel enrichment (actor, TTPs) Β±0.05
8 scan_host_vulnerabilities Discover CVEs on a host Β±0.05
9 trigger_playbook Execute SOAR automated response Β±0.10
10 submit_containment_plan Final report β€” ends episode 0.0 to 1.0

4.2 Data Models β€” models.py

All data flows through strict Pydantic models (429 lines):

  • Enums: Severity, ThreatType (13 types), HostStatus, SubnetRole (6 roles)
  • Sub-models: Alert, HostInfo, NetworkTopology, ForensicsResult, TimelineEntry
  • SOCObservation (extends OpenEnv Observation): 20+ fields including alert_queue, network_topology, host_forensics, threat_graph_summary, reward_dimensions, available_playbooks
  • Actions: Discriminated union of 10 action types via SOCActionWrapper
  • SOCState (internal): Tracks all episode state β€” killed processes, blocked IOCs, isolated subnets, etc.

4.3 Task Definitions β€” tasks.py

Three hand-crafted benchmark scenarios:

Task Threats Hosts Max Steps Description
Easy 1 1 15 Single ransomware on WS-042
Medium 3 4 25 Phishing β†’ credential theft β†’ lateral movement across 3 subnets
Hard 5 7 30 Full APT: phishing β†’ C2 β†’ privesc β†’ exfil β†’ ransomware

Network: ~75 active hosts across 6 subnets (corporate, engineering, finance, DMZ, datacenter, executive) with realistic processes, ports, and criticality scores.

4.4 Procedural Task Generator β€” task_generator.py

Generates 1,000+ unique deterministic scenarios from a seed:

  • hash(task_id) β†’ deterministic random.Random seed β†’ drives ALL choices
  • Template pools: 90+ malware process names, 40 C2 domains, 36 C2 IPs, 12 ransomware extensions, 12 data types
  • 3 difficulty tiers: Easy (1 threat), Medium (2-3 threats, multi-stage chains), Hard (3-6 threats, APT campaigns)
  • Alert generation: Templated descriptions with randomized details (timestamps, file counts, data sizes)

4.5 Grading System β€” graders.py

10-dimensional weighted grading:

Dimension Weight What It Measures
threat_containment 0.20 Fraction of required process kills completed
ioc_blocking 0.12 Fraction of known IOCs blocked (penalizes blind blocking)
forensic_investigation 0.10 Compromised hosts examined
siem_correlation 0.08 Whether alerts were correlated (bonus for early correlation)
threat_intel_usage 0.08 IOCs enriched with threat intel
vuln_root_cause 0.08 CVE root causes discovered (bonus if cited in plan)
business_impact 0.10 Penalizes unnecessary isolation and over-isolation (>20% = -0.30)
step_efficiency 0.07 Rewards SOAR playbook usage, penalizes step overrun
plan_coverage 0.10 Threats addressed in final plan
plan_evidence_quality 0.07 Evidence confidence from ThreatGraph

Anti-gaming: Per-occurrence penalty cap (Β±0.15), blind-blocking penalties, normalized evidence confidence.

4.6 Threat Graph β€” threat_graph.py

A typed knowledge graph tracking all SOC entities:

  • 5 Node Types: HostNode, ProcessNode, IOCNode, VulnerabilityNode, AlertNode
  • 6 Edge Types: runs_on, involves, communicates_with, pivoted_from, part_of_chain, exploits
  • 200-node cap with LRU IOC pruning
  • Version tracking with changelog for delta queries
  • Evidence confidence computation for plan quality scoring
  • Context summary generation for LLM injection

4.7 SOAR Playbooks β€” soar_playbooks.py

5 automated response playbooks with prerequisite validation:

Playbook Prerequisites Sub-Actions
ransomware_containment Forensics run, process identified kill_process, block_ioc
c2_disruption IOC enriched, C2 IP identified block_ioc, isolate_segment
lateral_movement_lockdown Forensics run, lateral movement detected kill_process, isolate_segment
phishing_response Phishing vector confirmed enrich_ioc, block_ioc
data_exfil_stop Forensics run, exfil destination identified block_ioc, kill_process

4.8 Action Validation β€” action_validation.py

3-gate middleware:

  1. Phase whitelist β€” Actions restricted by phase (triage/investigation/remediation/report)
  2. Schema validation β€” Required arguments checked
  3. Graph groundedness β€” Actions must reference discovered entities (can't block an IOC you haven't seen)

4.9 Tool Router β€” tool_router.py

Deterministic phase state machine:

  • Phases: triage β†’ investigation β†’ remediation β†’ report β†’ done
  • Loop limits: max 4 investigation loops, 3 remediation loops
  • Supports pushback β€” agent can justify staying in a phase with graph references

Triage Solver: Priority = severity_weight Γ— criticality_weight Γ— (1 + blast_radius/10)

4.10 Episode Sandbox β€” episode_sandbox.py

Safety guardrails:

  • 120-second wall-clock timeout per episode
  • 20-step hard limit per episode
  • State integrity protection β€” Protected fields (_task_def, _live_requirements, _threat_graph) are snapshot-hashed; mutations are detected and rolled back
  • Hacking detection β€” Reports any external state tampering

4.11 Adaptive Red Team

Two mechanisms in play_environment.py:

  1. Reinfection (_maybe_reinfect): 30% chance when killing a process if IOCs in the chain are unblocked β†’ spawns process_v2 variant + CRITICAL alert
  2. Lateral Pivot (_execute_lateral_pivot): Triggered by isolate/kill actions on hard tasks β†’ copies malware to adjacent healthy host, adds pivoted_from edge, emits PIVOT alert, updates live requirements

Escalation: Probability increases when agent is slow (step > 10 with 0 containments).

4.12 Server Application β€” app.py

FastAPI app created via OpenEnv's create_app():

  • POST /reset β€” Reset environment with task_id
  • POST /step β€” Execute an action
  • GET /state β€” Get current state
  • WS /ws β€” WebSocket for persistent sessions
  • CORS enabled for dashboard communication
  • Supports 4 concurrent environment instances

5. Frontend (Dashboard)

5.1 Overview

A real-time "CyberSOC Command Center" web dashboard with 6 panels, built with vanilla HTML/CSS/JS + D3.js + Chart.js.

5.2 Six Dashboard Panels

  1. Alert Queue β€” Live SIEM/EDR alerts with severity badges and IOC indicators
  2. Live Threat Graph β€” D3.js force-directed graph with 5 node types, drag/zoom, glow effects, pivot animation
  3. Agent Actions β€” Chronological action log with reward tracking
  4. Network Topology β€” Visual subnet map with compromised/isolated counts
  5. Performance Metrics β€” Chart.js radar chart (10 dimensions) + cumulative reward timeline
  6. Mission Status β€” Containment progress bars, business impact gauge, active threat list, episode controls, πŸ”΄ Red Team Toolkit (hotseat multiplayer)

5.3 Visual Design

  • Dark theme with glassmorphism panels
  • Typography: Inter (UI) + JetBrains Mono (data)
  • Color system: Accent colors for cyan, green, amber, red, purple
  • Animations: Count-up numbers, scale bounces, pulse glows, screen flashes
  • Red Team pivot: Screen border flash, toast notification, traveling dot animation on pivot edges

5.4 Key Frontend Components

graphs.js (881 lines):

  • ClientThreatGraph β€” Client-side graph state manager synced from observations
  • ThreatGraphViz β€” D3.js v7 force simulation with SVG glow filters, curved edges, node symbols (circle/diamond/triangle/square/wye), click-to-highlight, drag behavior
  • RadarChart β€” Chart.js 10-axis radar for live grading dimensions
  • RewardTimeline β€” Gradient-filled cumulative reward line chart

app.js (45KB) β€” Main orchestrator handling episode lifecycle, API calls, UI updates, phase indicator tracking

api.js β€” REST client with auto-detection of server origin, session management, response parsing

animations.js β€” Utility library for count-up, screen flash, toast notifications, scale bounce, pulse glow, dramatic final score reveal

5.5 Dashboard Server β€” dashboard_server.py

Wraps the FastAPI app to also serve the dashboard as static files at /dashboard/. Prints a styled ASCII banner on startup.


6. Inference & Training

6.1 Inference Script β€” inference.py

LLM baseline agent using OpenAI-compatible API:

  • System prompt defines SOC analyst role with all 6 core actions
  • Formats observations into structured text for the LLM
  • Parses JSON actions from LLM responses (with fallback extraction)
  • Runs episodes across easy/medium/hard tasks
  • Emits structured stdout logs: [START], [STEP], [END] (hackathon requirement)
  • Default model: Qwen/Qwen2.5-72B-Instruct via HuggingFace Router

6.2 GRPO Reward Functions β€” training/reward_funcs.py

10 TRL-compatible reward functions for Group Relative Policy Optimization:

from training.reward_funcs import make_soc_reward_funcs
reward_fns = make_soc_reward_funcs("http://localhost:8000")
trainer = GRPOTrainer(model=model, reward_funcs=reward_fns, args=GRPOConfig(...))

Each function:

  1. Parses completion as JSON action list
  2. Replays actions against live environment server
  3. Returns the specific dimension's score from grade_breakdown
  4. Non-parseable completions return 0.0

6.3 Per-Step Reward Dimensions

The environment computes live partial scores every step (_compute_reward_dimensions) for GRPO credit assignment without waiting for the terminal grade. These are exposed in SOCObservation.reward_dimensions.


7. Testing

11 test files covering all major components:

File Focus
test_integration.py Full episode flows, phase violations, adaptive pivots, 10-dim grading, sandbox limits
test_task1.py - test_task9.py Individual task-specific validations

Key integration tests:

  • Easy/medium episodes complete without crashes
  • All 10 action types can be exercised in a single episode
  • Phase violations return negative reward (not crash)
  • Adaptive pivot fires on hard tasks
  • Step rewards accumulate correctly and are idempotent
  • Grader returns exactly 10 dimensions
  • Sandbox step limit raises EpisodeTimeout

8. Deployment & DevOps

Docker

  • Root Dockerfile β€” Slim Python 3.10, serves on port 7860 (HuggingFace Spaces)
  • Server Dockerfile β€” Multi-stage build from ghcr.io/meta-pytorch/openenv-base, uses uv for dependency management, health check on /health

Validation

validate_submission.sh β€” 3-step validator:

  1. Ping HF Space /reset endpoint
  2. Docker build succeeds
  3. openenv validate passes

OpenEnv Manifest

openenv.yaml β€” 1,003 task definitions with descriptions, max steps, and difficulty tags. Used by the OpenEnv framework for task discovery and benchmarking.


9. Environment Variables

Variable Purpose Default
API_BASE_URL LLM API endpoint https://router.huggingface.co/v1
MODEL_NAME Model identifier Qwen/Qwen2.5-72B-Instruct
HF_TOKEN HuggingFace API key β€”

10. Data Flow

sequenceDiagram
    participant Agent as LLM Agent
    participant Inf as inference.py
    participant Env as CyberSOCEnvironment
    participant TG as ThreatGraph
    participant Gr as Grader

    Inf->>Env: reset(task_id="hard")
    Env->>TG: populate from task_def
    Env-->>Inf: SOCObservation (alerts, topology)
    
    loop Each Step
        Inf->>Agent: format_observation β†’ LLM prompt
        Agent-->>Inf: JSON action
        Inf->>Env: step(SOCActionWrapper)
        Env->>Env: ActionMiddleware.validate()
        Env->>Env: Handle action (query/forensics/kill/etc)
        Env->>TG: Update graph nodes/edges
        Env->>Env: _adversary_react() (adaptive pivot)
        Env->>Env: _compute_reward_dimensions()
        Env-->>Inf: SOCObservation (updated state)
    end
    
    Inf->>Env: step(submit_containment_plan)
    Env->>Gr: grade_episode(actions, plan, graph, task_def, state)
    Gr-->>Env: {final_score, breakdown[10], penalties, bonuses}
    Env-->>Inf: SOCObservation (done=true, final_score)

11. Red Team Design Philosophy

The Red Team is NOT a separate LLM agent. It is a deterministic adversarial dynamics engine that defines the environment's state transition function.

7 Behavioral Mechanisms

  1. Reactive Pivoting: Triggers on isolate_segment and kill_process (copy-not-move spread)
  2. Persistence: Reinfection triggers when a process is killed but its root IOC remains unblocked (teaches causal reasoning)
  3. Time Pressure: Pivot probability escalates +0.2 after step 10 if zero containments are achieved
  4. Controlled Randomness: Uses an episode-scoped self._rng (seeded by task_id) to ensure deterministic rollouts
  5. Noisy Observations: Benign processes mixed in host data
  6. Escalation: Pivot probabilities scale with difficulty (Easy: 0.0, Medium: 0.3, Hard: 0.8)
  7. Stall Punishment (new): If Blue makes 3+ consecutive passive actions (query_host / pass_turn) without containment, Red immediately deploys ransomware; plus a 15 % chance to spread laterally even on passive Blue turns

Attack Lifecycle Model (MITRE-aligned)

Phase 1: Compromise β†’ Phase 2: Lateral Movement β†’ Phase 3: Persistence β†’ Phase 4: Escalation β†’ Phase 5: Impact


12. Reward-Hacking Defense Map

Per guide Β§8, we implemented specific defenses against the known RL exploit vectors:

Guide Attack Vector Our Defense
Editing timers EpisodeSandbox wall-clock enforcement
Caching results Idempotent step rewards via _fired_step_rewards
Abusing globals Instance-scoped RNG + episode-scoped self._rng
Mutating protected state Sandbox hash-snapshot + rollback
Exploiting env bugs 3-gate validation middleware
Reward-function gaming Evidence confidence normalization
Cheating via blind remediation Graph-groundedness gate
Blind IOC blocking Enrichment-before-block penalty

13. Curriculum Learning Strategy

The 1000+ deterministic scenarios generated by task_generator.py are explicitly divided into three difficulty tiers to support Curriculum Learning (Guide Β§6).

This exists precisely to satisfy Daniel's Law of RL: The base model must get non-zero reward on Easy before it can meaningfully learn Hard.

  • Phase 1 (Warm-Start): gen_0001–gen_0333 (Easy). Single threat, 15 max steps, 0.0 pivot probability.
  • Phase 2 (Scaling): gen_0334–gen_0666 (Medium). Multi-stage, 25 max steps, 0.3 pivot probability.
  • Phase 3 (Stress-Test): gen_0667–gen_1000 (Hard). APT, 30 max steps, 0.8 pivot probability.

The adaptive pivot probability is itself a curriculum signal; the environment gets harder as the agent gets better.


14. Intended Training Stack

CyberSOCEnv is designed for the canonical stack specified in Guide Β§10:

  1. Unsloth: 4-bit QLoRA loading and efficient inference
  2. TRL: GRPOTrainer consuming our 10 independent callable functions via reward_funcs=[...]
  3. OpenEnv: WebSocket transport and session isolation
  4. vLLM: Serving the rollout workers for maximum throughput

A reference adapter module exists at training/reward_funcs.py that mirrors the Unsloth 2048 notebook structure 1:1, allowing plug-and-play GRPO training.


15. Anti-Patterns Avoided

How we avoided the 7 common mistakes listed in Guide Β§21:

  1. Building before designing env: Env, types, and sandbox were built and tested completely offline before any trainer was attached.
  2. LLM-as-a-judge: CyberSOCEnv uses zero LLM-as-judge signals. Everything is deterministic code against the ThreatGraph.
  3. Single monolithic reward: We use a 10-dimensional verifiable rubric, fed independently into TRL.
  4. Ignoring inference latency: We implemented Graph Delta Injection (10x fewer tokens) and a sparse-node generation strategy (75 active nodes) specifically to optimize GRPO rollout latency.
  5. No abuse prevention: 3-gate middleware + EpisodeSandbox explicitly prevent out-of-band cheating.
  6. Delayed deployment: Environment was packaged with Docker and deployed to HF Spaces early.
  7. Scaling prematurely: All 9 components passed integration testing (test_integration.py through test_task9.py) before scaling to 1000 tasks.

16. Known Deviations & Alignment Items

While we strive to match the OpenEnv canonical scaffolding (Guide Β§5), there are a few intentional architectural differences:

  1. Action Dispatch: We use a discriminated union wrapper (SOCActionWrapper with a type field) rather than a single flat action class. This matches the MCP ToolCall pattern and real SOC work better than a flat action space.
  2. Decoupled Engine: The core logic lives in server/play_environment.py, completely separate from the FastAPI transport layer in server/app.py. This ensures we can run headless parallel environments during GRPO without HTTP overhead if needed.

17. Team Structure & Role Split

Per guide Β§17, responsibilities are split across three functional roles to execute the RL pipeline effectively.

Role 1: Environment Engineer

Mission: Build a deterministic, unhackable, fast environment. Owns: play_environment.py, tasks.py, threat_graph.py, episode_sandbox.py, action_validation.py, tests/

  • Scope: Implements the state machine, Red Team behavior, and validates actions. Owns the core step() and reset() loops. Ensures the environment parses valid inputs and securely handles invalid ones.
  • Hackathon Focus: Bug fixes, latency optimization (graph deltas), sandbox integrity, and procedural scenario generation.

Role 2: Reward Engineer

Mission: Design the mathematical signals that shape model behavior. Owns: graders.py, training/reward_funcs.py, models.py

  • Scope: Creates the 10-dimensional verifiable grading logic. Plumbs the environment outputs into TRL-compatible reward_funcs. Tunes the penalties to prevent reward hacking (e.g., punishing blind IOC blocking).
  • Hackathon Focus: Ensuring the model gets positive step-rewards early on to prevent it from collapsing, while preventing it from finding "lazy" exploits.

Role 3: Training Engineer

Mission: Execute the GRPO curriculum and produce the final model. Owns: training/ directory, Colab notebooks, inference.py

  • Scope: Sets up the actual training loops using Unsloth and TRL. Manages the hyperparameter tuning, LoRA checkpointing, and vLLM inference configuration. Runs the curriculum from Easy to Hard.
  • Hackathon Focus: Capturing the before/after learning curves on held-out tasks to prove to the judges that the environment actually works to train a model.

18. Key Innovations Summary

Innovation Description
Procedural Generation SHA-256 seeded RNG generates 1000+ unique deterministic scenarios
ThreatGraph Typed knowledge graph with version tracking, evidence confidence, and LRU pruning
10-Dim Grading Weighted multi-dimensional scoring replacing binary pass/fail
Adaptive Red Team Attacker reacts to defender actions β€” lateral pivots and reinfection
SOAR Playbooks Prerequisite-gated automated response workflows
3-Gate Validation Phase whitelist + schema + graph-groundedness prevents invalid actions
Episode Sandbox State integrity protection with hash-based tampering detection
Live GRPO Signals Per-step reward dimensions for RL credit assignment
Anti-Gaming Blind-blocking penalties, over-isolation cap, idempotent step rewards (0.40 cap)
Real-time Dashboard D3.js threat graph with pivot animations and 10-dim radar chart
Hotseat Multiplayer Human Red Team player via in-dashboard toolkit; FSP backend; per-turn UI lock
Stall Punishment 3 consecutive passive Blue actions triggers ransomware deploy + 15 % autonomous pivot
Emergency Gate isolate_segment in triage requires a critical alert; UNJUSTIFIED_EMERGENCY β†’ -0.15 penalty
External Intel Feed task_def.external_intel_feed IOCs injected at reset β€” immediately blockable/enrichable
Doomsday Clock state.business_impact Γ— 0.30 applied as a direct score modifier; 90 % negligence crush when no threats contained

19. Technology Stack

Layer Technologies
Backend Python 3.10+, FastAPI, Uvicorn, Pydantic v2, OpenEnv Core
Frontend Vanilla HTML/CSS/JS, D3.js v7, Chart.js v4, Inter/JetBrains Mono fonts
Inference OpenAI Python SDK, asyncio
Training TRL (Hugging Face), GRPO
DevOps Docker (multi-stage), uv package manager, pytest
Deployment HuggingFace Spaces (Docker SDK)
Visualization NetworkX + Matplotlib (server-side PNG), D3.js (client-side interactive)