--- title: "Janus (AdaptShield): Adaptive Incident Response Under Polymorphic Adversaries" emoji: πŸ›‘οΈ colorFrom: blue colorTo: red sdk: docker pinned: false license: mit tags: - openenv - security - reinforcement-learning - cybersecurity short_description: Two-phase adaptive cybersecurity benchmark for LLMs --- # Janus (AdaptShield): Adaptive Incident Response Under Polymorphic Adversaries **AdaptShield** is the environment: a two-phase agentic cybersecurity simulator where an LLM defends a 4-node enterprise network against an adversary that shifts strategy mid-episode. **Janus** is the model we trained on it: a Qwen2.5-1.5B LoRA, supervised then refined with GRPO. On the hardest task Janus scores 0.90 on a held-out world family it never saw during training; a tool-aware heuristic baseline scores 0.18 on the same task. The skill being tested is narrow on purpose. Not threat classification. Not generic tool calling. The benchmark targets one thing: real-time adaptation when the attacker's playbook changes mid-incident. Section [Why this matters](#why-this-matters) explains why we think that's the gap, and the [Results](#results) section is where the gap closes. ## Project Links - **HF Space (live env):** [`SaiManish123/adaptshield`](https://huggingface.co/spaces/SaiManish123/adaptshield) - **Colab notebook (SFT + GRPO reproducer, free T4):** [`Project_Janus(AdaptShield)_Final.ipynb`](https://drive.google.com/file/d/1uI9BaQTsn8YXOAlCtQCr_0N6ixLbqlba/view?usp=sharing) - **Artifacts / model repo:** [`SaiManish123/Janus`](https://huggingface.co/SaiManish123/Janus) - **Demo video:** [`youtu.be/upX9a5zXHBM`](https://youtu.be/upX9a5zXHBM) --- ## Why this matters Most cyber-agent demos test threat classification or generic tool calling. Real production breaches don't look like that. They look like this: In April 2026 attackers compromised Context.ai, used its OAuth integration into a Vercel employee's Google Workspace, and pivoted from shadow AI through identity into Vercel's internal systems, where they enumerated and decrypted customer environment variables. The same week, a Broken Object Level Authorization flaw in Lovable.dev let any free-tier account read source code, Supabase credentials, Stripe keys, and AI chat histories from other tenants, including projects built by AI itself. Eight months earlier, the Tea dating app left a Firebase bucket open and 72,000 verification selfies and driver's licenses of women on a safety app were scraped to 4chan within hours. Three different failure modes, one underlying problem for the defender's agent: identity hijack via shadow AI, broken authorization in vibe-coded apps, and classic cloud misconfig. The environment is shifting faster than any static training distribution can keep up with, and the real attacker does not sit still while you classify them. Real campaigns drift through the kill chain (initial access, lateral movement, exfiltration) and the defender's job is to re-classify, contain, and eradicate as the picture changes. Static SOAR playbooks keyed to fixed indicators of compromise fail the moment the adversary rotates them; that is what an attacker TTP shift looks like in production, and it is the regime where dwell time blows out and Tier-1 triage starts dropping signal. AdaptShield is built around that pressure. The environment forces the agent to act on partial evidence, hand judgment across two roles with an information bottleneck between them, trade security correctness against operational blast radius, and re-plan when the attacker pivots mid-incident. Each of those is a separate failure mode in production SOC tooling, and the benchmark scores all four at once. --- ## Results Numbers below come from the production run on Hugging Face L4 Jobs, training Qwen2.5-1.5B-Instruct with a LoRA adapter. Eval is 50 deterministic seeds per task, evaluated on a held-out world family the policy never saw during training. ![AdaptShield held-out benchmark: tool-aware baseline vs SFT vs GRPO](assets/headline_results.png) On the hard task (`polymorphic-zero-day`) the tool-aware heuristic baseline scores 0.18 and Janus holds 0.90 on the held-out family. On the easier tasks the lift is smaller because the rule baseline is already near the ceiling; the benchmark is shaped so adaptation only matters where it should. ### Benchmark comparison (full table) | Task | No-tool baseline | Tool-aware baseline | SFT (train family) | SFT (held-out) | GRPO (train) | GRPO (held-out) | |------|-----------------:|-------------------:|-------------------:|---------------:|-------------:|----------------:| | `direct-triage` | 0.860 | 0.990 | 0.990 | 0.990 | 0.990 | 0.990 | | `dual-pivot` | 0.650 | 0.640 | 0.825 | 0.825 | 0.825 | 0.825 | | `polymorphic-zero-day` | 0.380 | 0.180 | 0.960 | 0.930 | **0.883** | **0.902** | Two things in this table are worth flagging. The tool-aware baseline scores 0.18 on the hard task, worse than the no-tool baseline at 0.38. That is not a bug in the baseline; it is that bolting tools onto a heuristic without learning when to trust them makes the agent over-trigger on injected false positives. You see the same pattern in production with rule-based SOAR playbooks against adaptive adversaries. Held-out GRPO (0.902) actually edges out train-family GRPO (0.883). That is evidence the policy is generalizing across world templates rather than memorizing them. Without splitting the eval by world family this finding would not be visible. Same-seed evaluation would have credited the model for memorization it did not do. ### SFT: loss and held-out reward ![SFT loss curve](https://huggingface.co/SaiManish123/Janus/resolve/main/sft_worldsplit_1_5b/loss_curve.png) ![SFT learning curve: tool-aware baseline anchor, train family vs held-out family across checkpoints](https://huggingface.co/SaiManish123/Janus/resolve/main/sft_worldsplit_1_5b/reward_curve.png?v=2) ### GRPO: refinement on the polymorphic adversary ![GRPO reward curve, polymorphic-zero-day](https://huggingface.co/SaiManish123/Janus/resolve/main/grpo_polymorphic_zero_day_1_5b/reward_curve.png) ### Training runs Three production runs on Hugging Face Jobs produced the artifacts in this README. Stdout logs are public and the per-step / per-episode metrics files are next to the adapters. | Run | Trainer | GPU | Steps / Episodes | Train wall-clock | Logs | Metrics | |-----|---------|-----|------------------|------------------|------|---------| | [`sft_worldsplit_1_5b`](https://huggingface.co/SaiManish123/Janus/tree/main/sft_worldsplit_1_5b) | SFT (LoRA) | L4 Γ—1 | 378 steps | 9m 49s | [stdout](https://huggingface.co/SaiManish123/Janus/blob/main/logs/sft_worldsplit_1_5b.log) | [trainer_state](https://huggingface.co/SaiManish123/Janus/blob/main/sft_worldsplit_1_5b/checkpoint-378/trainer_state.json) | | [`grpo_worldsplit_1_5b`](https://huggingface.co/SaiManish123/Janus/tree/main/grpo_worldsplit_1_5b) | GRPO, mixed curriculum | L4 Γ—1 | 1,628 episodes | 1h 26m | [stdout](https://huggingface.co/SaiManish123/Janus/blob/main/logs/grpo_worldsplit_1_5b.log) | [per-episode](https://huggingface.co/SaiManish123/Janus/blob/main/grpo_worldsplit_1_5b/metrics.json) | | [`grpo_polymorphic_zero_day_1_5b`](https://huggingface.co/SaiManish123/Janus/tree/main/grpo_polymorphic_zero_day_1_5b) | GRPO, hard-task focus | L4 Γ—1 | 4,357 episodes | 3h 17m | [stdout](https://huggingface.co/SaiManish123/Janus/blob/main/logs/grpo_polymorphic_zero_day_1_5b.log) | [per-episode](https://huggingface.co/SaiManish123/Janus/blob/main/grpo_polymorphic_zero_day_1_5b/metrics.json) | The curriculum run mixes all three tasks (weights `direct-triage: 0.3 / dual-pivot: 0.4 / polymorphic-zero-day: 0.3`). The polymorphic run trains exclusively on the hard task to push hard-task performance without distraction from saturated tiers. Per-episode reward in both runs stabilizes within the first ~500 episodes and stays there for the rest of the schedule. --- ## Architecture ![AdaptShield architecture overview](assets/architecture_overview.svg) Each episode runs against a sampled mission profile, world-family template, and latent operational mode. The Threat Analyst investigates raw enterprise evidence through SOC tools and emits a structured handoff. The Tactical Executor sees only that handoff (not the raw state) and chooses the mitigation. The split mirrors the Tier-1-to-Tier-2 escalation in a real SOC, where the responder acts on the analyst's written triage and never re-examines the raw telemetry. A deterministic Python grader scores security correctness, business impact, dependency blast radius, and mission alignment. There is no LLM-as-judge anywhere in the loop. ## Training Pipeline ![Janus training pipeline](assets/training_pipeline.svg) Five steps, each reproducible from the repo: 1. Generate SFT demonstrations by rolling AdaptShield episodes with a rule-based Phase 1 expert and a tool-aware Phase 2 expert. 2. Train a LoRA adapter on Qwen2.5-1.5B (or 0.5B for the Colab reproducer) with supervised fine-tuning on those demos. 3. Evaluate on both train-family and held-out-family worlds. The split is by world template, not by seed, so memorizing a template doesn't transfer across the split. 4. Refine the SFT adapter with GRPO on a curriculum weighted toward `polymorphic-zero-day`. The deterministic grader is the reward. 5. Publish adapters, curves, metrics, and benchmark tables to [`SaiManish123/Janus`](https://huggingface.co/SaiManish123/Janus). A free-tier Colab notebook reproduces steps 1-4 end-to-end on a T4 in roughly 35 minutes using Qwen2.5-0.5B and reduced episode budgets. The numbers in this README come from the 1.5B run on a Hugging Face L4 Job. --- ## Environment Description The agent defends a 4-node enterprise network (`auth_service`, `payment_service`, `database`, `api_gateway`). Each turn has two phases: **Phase 1 (Threat Analyst).** Agent reads SIEM metrics, can call SOC tools (log search, network telemetry, threat intel lookup), and emits a structured `Phase1Action` with threat type, target node, confidence and a recommended action. **Phase 2 (Tactical Executor).** Agent receives only the Phase 1 assessment (blind to raw state) and emits a `Phase2Action`. The analyst has to communicate clearly because the executor cannot double-check the network. The attacker escalates through `recon β†’ exploit β†’ exfiltration` if the agent fails to respond correctly. On the hard task, the attacker shifts strategy mid-episode and seeds false-positive noise that looks like a real attack but isn't, which punishes reflexive isolation. This is the alert-fatigue regime that drives most production SOC false-positive budgets. ### Observation Space ```json { "phase": "1 or 2", "network_nodes": { "auth_service": {"status": "...", "request_rate": 0, "error_rate": 0.0, "cpu": 0} }, "active_alerts": ["raw metric alert strings (no MITRE codes)"], "attack_stage": "recon | exploit | exfiltration | none", "history": [{"turn": "1", "p1": "classified:brute_force", "p2": "rate_limitβ†’auth_service"}], "phase1_assessment": {"threat_type": "...", "confidence": 0.9, "target_node": "..."}, "metadata": {"normalized_score": 0.72} } ``` Phase 2 observations have empty `network_nodes` and `active_alerts`. The executor only sees the analyst's handoff. ### Action Space **Phase 1 (`Phase1Action`):** ```json {"threat_type": "brute_force", "confidence": 0.9, "target_node": "auth_service", "recommended_action": "rate_limit", "reasoning": "..."} ``` **Phase 2 (`Phase2Action`):** ```json {"action": "rate_limit", "target_node": "auth_service", "reasoning": "..."} ``` Valid actions: `rate_limit`, `isolate`, `honeypot`, `patch`, `monitor`. ### Tasks | Task | Difficulty | Description | Rule baseline | |------|-----------|-------------|--------------:| | `direct-triage` | Easy | Single fixed strategy | ~0.87 | | `dual-pivot` | Medium | Two alternating strategies | ~0.76 | | `polymorphic-zero-day` | Hard | All four + mid-episode shift + noise | ~0.52 | ### Reward Function | Outcome | Reward | |---------|-------:| | Phase 1 threat type correct | +0.15 | | Phase 1 target node correct | +0.10 | | Phase 2 optimal action + correct target | +0.39 | | Phase 2 heavy-handed but effective | +0.18 | | Phase 2 wrong action | -0.25 | | False positive on benign event | -0.39 | | Catastrophic: database exfiltrated | -0.49, `done=True` | Scores are clipped to the open interval `(0.01, 0.99)`. The grader never emits exactly 0 or 1, which keeps GRPO advantages well-defined. ### Operational Impact Layer AdaptShield also scores business impact, so the agent is rewarded for stopping the attack without ignoring operational blast radius. Each service has a criticality weight and a dependency fan-out: | Service | Criticality | Downstream dependency risk | |---------|------------:|----------------------------| | `auth_service` | 0.70 | `payment_service` | | `payment_service` | 0.90 | `api_gateway` | | `database` | 1.00 | `payment_service`, `api_gateway` | | `api_gateway` | 0.80 | `auth_service`, `payment_service`, `database` | Actions have bounded disruption costs (`monitor` = none, `isolate` = highest). The grader emits `business_impact`, `availability_impact`, `security_risk`, `dependency_blast_radius`, and `operational_penalty` inside `score_breakdown`. The reward adjustment is capped at `Β±0.05` per turn, which keeps the training signal stable while leaving the replay detailed enough to explain whether the agent stopped the attack cleanly or caused unnecessary business disruption getting there. This is the MTTR-versus-availability tradeoff every SOC actually navigates: containment that bricks `auth_service` to stop a credential-stuffing campaign also takes legitimate users offline, so "isolate everything" is not a winning playbook. ### Mission-Aware Objectives Each task carries a mission profile, visible in observation metadata and appended to the system prompt: | Task | Mission | Primary Asset | SLA Priority | Risk Tolerance | |------|---------|---------------|--------------|----------------| | `direct-triage` | `login_stability` | `auth_service` | availability | medium | | `dual-pivot` | `checkout_continuity` | `payment_service` | availability | medium | | `polymorphic-zero-day` | `breach_containment` | `database` | containment | low | The grader emits `mission_alignment` and `mission_adjustment`, capped at `Β±0.04` per turn. This makes the agent optimize for the operational mission, not just the threat label. Availability-priority missions discourage unnecessary isolation of the primary asset; containment missions reward decisive correct containment of the crown-jewel database. ### Design choices that aren't obvious A few decisions in the environment that look like details but matter for what the benchmark actually measures: - **Information bottleneck between phases.** Phase 2's observation has empty `network_nodes` and `active_alerts`. The executor only sees Phase 1's structured handoff. If Phase 1 cannot communicate clearly, Phase 2 fails, and you see it in the score, not in a separate metric. This is what makes the env actually test cross-role coordination rather than just two independent policies stitched together. - **Train/eval split by world family, not by seed.** The world templates used for training are disjoint from the ones used for held-out evaluation. A model that overfits to a specific service-name pattern or a specific alert distribution will pass train evals and fail held-out. Same-seed evaluation would have hidden this. - **Open scoring interval `(0.01, 0.99)`.** The grader never emits exactly 0 or 1. This keeps GRPO advantage estimates well-defined. Saturating rewards collapse the variance the algorithm needs. - **Bounded auxiliary signals.** Operational impact is capped at `Β±0.05` per turn and mission alignment at `Β±0.04`. They steer the policy without dominating the security signal, so the training curve does not get hijacked by a single side-objective. - **Deterministic Python grader, no LLM-as-judge.** Rewards come from strategy matching against a fixed ground-truth attacker, not from a judge model. The benchmark cannot be gamed by a more eloquent policy. - **Phase-1 alerts are raw metric strings, not pre-tagged MITRE ATT&CK techniques.** The agent has to do the classification itself, not match a label to a label. This is what makes the heuristic baseline collapse on the hard task: rule-based classification keyed on fixed indicators of compromise does not survive the injected false-positive noise that real polymorphic adversaries use to drown Tier-1 triage. --- ## Reproduce it ### Free-tier Colab (recommended for judges) Open the Colab notebook linked above and run top-to-bottom. It will: - install the exact pinned dependency stack used in the HF Job - generate SFT demos from the environment - train an SFT LoRA on Qwen2.5-0.5B (T4-friendly) - run GRPO refinement on top of that SFT adapter - print the benchmark table and inline the production training curves from `SaiManish123/Janus` so you can compare scaled-down vs. full runs End-to-end runtime on a Colab T4 is roughly 35 minutes. ### Local setup ```bash pip install openenv-core git clone https://github.com/SaiManish123/adaptshield cd adaptshield python -m adaptshield.server.app ``` ### Run inference against the live environment ```bash export HF_TOKEN=your_token export ADAPTSHIELD_TASK=direct-triage # or dual-pivot / polymorphic-zero-day export ENV_BASE_URL=http://localhost:7860 python inference.py # run from the repo root ``` `inference.py` honors the evaluator contract: `[START]`, `[STEP]`, `[END]` stdout markers and credentials read only from environment variables. ### Smoke test ```bash python smoke_test.py ``` Spins the env up in-process and walks one episode of each task with a deterministic policy. Should finish in <10 seconds. ### Regression tests ```bash adaptshield/.venv/bin/python -m unittest tests.test_regression -v ``` ### Baseline scores With `ADAPTSHIELD_SEED=42`, the deterministic rule baseline produces: | Task | Score | Steps | Status | |------|------:|------:|--------| | `direct-triage` | 0.870 | 10 | PASS | | `dual-pivot` | 0.760 | 12 | PASS | | `polymorphic-zero-day` | 0.520 | 16 | PASS | Difficulty staircase: **PASS**. --- ## Repository layout ``` adaptshield/ β”œβ”€β”€ server/ # FastAPI server (OpenEnv-compatible) β”œβ”€β”€ client.py # OpenEnv client (no server-internal imports) β”œβ”€β”€ models.py # Phase1Action / Phase2Action schemas β”œβ”€β”€ soc_tools.py # SIEM, log search, threat intel SOC tools β”œβ”€β”€ eval_tasks.py # task definitions + difficulty staircase β”œβ”€β”€ baseline.py # deterministic rule baseline β”œβ”€β”€ tool_baseline.py # tool-aware heuristic baseline β”œβ”€β”€ generate_sft_data.py # rolls episodes β†’ SFT JSONL β”œβ”€β”€ train_sft.py # LoRA SFT trainer (Unsloth + TRL) β”œβ”€β”€ train.py # GRPO trainer (Unsloth + TRL) β”œβ”€β”€ plot_training.py # reward / loss curve plotting β”œβ”€β”€ build_benchmark_table.py # eval matrix builder β”œβ”€β”€ inference.py # judge-facing entry point β”œβ”€β”€ smoke_test.py # one-shot in-process smoke test β”œβ”€β”€ tests/test_regression.py # determinism + reward regression tests β”œβ”€β”€ openenv.yaml # OpenEnv manifest └── Dockerfile # HF Space container ``` ## Engineering notes `AdaptShieldEnvironment` extends OpenEnv's `Environment` base class and follows the Gym-style API (`reset`, `step`, `state`). The client in `client.py` talks to the server only through HTTP, with no shared imports and no leaking of server internals. None of the SOC tools are named `reset`, `step`, `state`, or `close`, so they do not collide with the reserved MCP tool names. Grading is deterministic Python; the reward signal and the benchmark scores both come from strategy matching against a fixed ground-truth attacker, never from an LLM judge. All adapters, curves, metrics, and benchmark tables for the 1.5B run are public on [`SaiManish123/Janus`](https://huggingface.co/SaiManish123/Janus). ## License MIT.