| --- |
| title: "Janus (AdaptShield): Adaptive Incident Response Under Polymorphic Adversaries" |
| emoji: π‘οΈ |
| colorFrom: blue |
| colorTo: red |
| sdk: docker |
| pinned: false |
| license: mit |
| tags: |
| - openenv |
| - security |
| - reinforcement-learning |
| - cybersecurity |
| short_description: Two-phase adaptive cybersecurity benchmark for LLMs |
| --- |
| |
| # Janus (AdaptShield): Adaptive Incident Response Under Polymorphic Adversaries |
|
|
| **AdaptShield** is the environment: a two-phase agentic cybersecurity |
| simulator where an LLM defends a 4-node enterprise network against an |
| adversary that shifts strategy mid-episode. **Janus** is the model we |
| trained on it: a Qwen2.5-1.5B LoRA, supervised then refined with GRPO. |
| On the hardest task Janus scores 0.90 on a held-out world family it |
| never saw during training; a tool-aware heuristic baseline scores 0.18 |
| on the same task. |
|
|
| The skill being tested is narrow on purpose. Not threat classification. |
| Not generic tool calling. The benchmark targets one thing: real-time |
| adaptation when the attacker's playbook changes mid-incident. Section |
| [Why this matters](#why-this-matters) explains why we think that's the |
| gap, and the [Results](#results) section is where the gap closes. |
|
|
| ## Project Links |
|
|
| - **HF Space (live env):** [`SaiManish123/adaptshield`](https://huggingface.co/spaces/SaiManish123/adaptshield) |
| - **Colab notebook (SFT + GRPO reproducer, free T4):** [`Project_Janus(AdaptShield)_Final.ipynb`](https://drive.google.com/file/d/1uI9BaQTsn8YXOAlCtQCr_0N6ixLbqlba/view?usp=sharing) |
| - **Artifacts / model repo:** [`SaiManish123/Janus`](https://huggingface.co/SaiManish123/Janus) |
| - **Demo video:** [`youtu.be/upX9a5zXHBM`](https://youtu.be/upX9a5zXHBM) |
|
|
| --- |
|
|
| ## Why this matters |
|
|
| Most cyber-agent demos test threat classification or generic tool |
| calling. Real production breaches don't look like that. They look like |
| this: |
|
|
| In April 2026 attackers compromised Context.ai, used its OAuth |
| integration into a Vercel employee's Google Workspace, and pivoted from |
| shadow AI through identity into Vercel's internal systems, where they |
| enumerated and decrypted customer environment variables. The same week, |
| a Broken Object Level Authorization flaw in Lovable.dev let any |
| free-tier account read source code, Supabase credentials, Stripe keys, |
| and AI chat histories from other tenants, including projects built by |
| AI itself. Eight months earlier, the Tea dating app left a Firebase |
| bucket open and 72,000 verification selfies and driver's licenses of |
| women on a safety app were scraped to 4chan within hours. |
|
|
| Three different failure modes, one underlying problem for the |
| defender's agent: identity hijack via shadow AI, broken authorization |
| in vibe-coded apps, and classic cloud misconfig. The environment is |
| shifting faster than any static training distribution can keep up with, |
| and the real attacker does not sit still while you classify them. |
|
|
| Real campaigns drift through the kill chain (initial access, lateral |
| movement, exfiltration) and the defender's job is to re-classify, |
| contain, and eradicate as the picture changes. Static SOAR playbooks |
| keyed to fixed indicators of compromise fail the moment the adversary |
| rotates them; that is what an attacker TTP shift looks like in |
| production, and it is the regime where dwell time blows out and Tier-1 |
| triage starts dropping signal. |
|
|
| AdaptShield is built around that pressure. The environment forces the |
| agent to act on partial evidence, hand judgment across two roles with |
| an information bottleneck between them, trade security correctness |
| against operational blast radius, and re-plan when the attacker pivots |
| mid-incident. Each of those is a separate failure mode in production |
| SOC tooling, and the benchmark scores all four at once. |
|
|
| --- |
|
|
| ## Results |
|
|
| Numbers below come from the production run on Hugging Face L4 Jobs, |
| training Qwen2.5-1.5B-Instruct with a LoRA adapter. Eval is 50 |
| deterministic seeds per task, evaluated on a held-out world family |
| the policy never saw during training. |
|
|
|  |
|
|
| On the hard task (`polymorphic-zero-day`) the tool-aware heuristic |
| baseline scores 0.18 and Janus holds 0.90 on the held-out family. On |
| the easier tasks the lift is smaller because the rule baseline is |
| already near the ceiling; the benchmark is shaped so adaptation only |
| matters where it should. |
|
|
| ### Benchmark comparison (full table) |
|
|
| | Task | No-tool baseline | Tool-aware baseline | SFT (train family) | SFT (held-out) | GRPO (train) | GRPO (held-out) | |
| |------|-----------------:|-------------------:|-------------------:|---------------:|-------------:|----------------:| |
| | `direct-triage` | 0.860 | 0.990 | 0.990 | 0.990 | 0.990 | 0.990 | |
| | `dual-pivot` | 0.650 | 0.640 | 0.825 | 0.825 | 0.825 | 0.825 | |
| | `polymorphic-zero-day` | 0.380 | 0.180 | 0.960 | 0.930 | **0.883** | **0.902** | |
|
|
| Two things in this table are worth flagging. |
|
|
| The tool-aware baseline scores 0.18 on the hard task, worse than the |
| no-tool baseline at 0.38. That is not a bug in the baseline; it is |
| that bolting tools onto a heuristic without learning when to trust them |
| makes the agent over-trigger on injected false positives. You see the |
| same pattern in production with rule-based SOAR playbooks against |
| adaptive adversaries. |
|
|
| Held-out GRPO (0.902) actually edges out train-family GRPO (0.883). That |
| is evidence the policy is generalizing across world templates rather |
| than memorizing them. Without splitting the eval by world family this |
| finding would not be visible. Same-seed evaluation would have credited |
| the model for memorization it did not do. |
|
|
| ### SFT: loss and held-out reward |
|
|
|  |
|
|
|  |
|
|
| ### GRPO: refinement on the polymorphic adversary |
|
|
|  |
|
|
| ### Training runs |
|
|
| Three production runs on Hugging Face Jobs produced the artifacts in this |
| README. Stdout logs are public and the per-step / per-episode metrics |
| files are next to the adapters. |
|
|
| | Run | Trainer | GPU | Steps / Episodes | Train wall-clock | Logs | Metrics | |
| |-----|---------|-----|------------------|------------------|------|---------| |
| | [`sft_worldsplit_1_5b`](https://huggingface.co/SaiManish123/Janus/tree/main/sft_worldsplit_1_5b) | SFT (LoRA) | L4 Γ1 | 378 steps | 9m 49s | [stdout](https://huggingface.co/SaiManish123/Janus/blob/main/logs/sft_worldsplit_1_5b.log) | [trainer_state](https://huggingface.co/SaiManish123/Janus/blob/main/sft_worldsplit_1_5b/checkpoint-378/trainer_state.json) | |
| | [`grpo_worldsplit_1_5b`](https://huggingface.co/SaiManish123/Janus/tree/main/grpo_worldsplit_1_5b) | GRPO, mixed curriculum | L4 Γ1 | 1,628 episodes | 1h 26m | [stdout](https://huggingface.co/SaiManish123/Janus/blob/main/logs/grpo_worldsplit_1_5b.log) | [per-episode](https://huggingface.co/SaiManish123/Janus/blob/main/grpo_worldsplit_1_5b/metrics.json) | |
| | [`grpo_polymorphic_zero_day_1_5b`](https://huggingface.co/SaiManish123/Janus/tree/main/grpo_polymorphic_zero_day_1_5b) | GRPO, hard-task focus | L4 Γ1 | 4,357 episodes | 3h 17m | [stdout](https://huggingface.co/SaiManish123/Janus/blob/main/logs/grpo_polymorphic_zero_day_1_5b.log) | [per-episode](https://huggingface.co/SaiManish123/Janus/blob/main/grpo_polymorphic_zero_day_1_5b/metrics.json) | |
|
|
| The curriculum run mixes all three tasks (weights `direct-triage: 0.3 / |
| dual-pivot: 0.4 / polymorphic-zero-day: 0.3`). The polymorphic run |
| trains exclusively on the hard task to push hard-task performance |
| without distraction from saturated tiers. Per-episode reward in both |
| runs stabilizes within the first ~500 episodes and stays there for the |
| rest of the schedule. |
|
|
| --- |
|
|
| ## Architecture |
|
|
|  |
|
|
| Each episode runs against a sampled mission profile, world-family |
| template, and latent operational mode. The Threat Analyst investigates |
| raw enterprise evidence through SOC tools and emits a structured |
| handoff. The Tactical Executor sees only that handoff (not the raw |
| state) and chooses the mitigation. The split mirrors the |
| Tier-1-to-Tier-2 escalation in a real SOC, where the responder acts on |
| the analyst's written triage and never re-examines the raw telemetry. |
| A deterministic Python grader scores security correctness, business |
| impact, dependency blast radius, and mission alignment. There is no |
| LLM-as-judge anywhere in the loop. |
|
|
| ## Training Pipeline |
|
|
|  |
|
|
| Five steps, each reproducible from the repo: |
|
|
| 1. Generate SFT demonstrations by rolling AdaptShield episodes with a |
| rule-based Phase 1 expert and a tool-aware Phase 2 expert. |
| 2. Train a LoRA adapter on Qwen2.5-1.5B (or 0.5B for the Colab |
| reproducer) with supervised fine-tuning on those demos. |
| 3. Evaluate on both train-family and held-out-family worlds. The split |
| is by world template, not by seed, so memorizing a template doesn't |
| transfer across the split. |
| 4. Refine the SFT adapter with GRPO on a curriculum weighted toward |
| `polymorphic-zero-day`. The deterministic grader is the reward. |
| 5. Publish adapters, curves, metrics, and benchmark tables to |
| [`SaiManish123/Janus`](https://huggingface.co/SaiManish123/Janus). |
|
|
| A free-tier Colab notebook reproduces steps 1-4 end-to-end on a T4 in |
| roughly 35 minutes using Qwen2.5-0.5B and reduced episode budgets. The |
| numbers in this README come from the 1.5B run on a Hugging Face L4 Job. |
|
|
| --- |
|
|
| ## Environment Description |
|
|
| The agent defends a 4-node enterprise network (`auth_service`, |
| `payment_service`, `database`, `api_gateway`). Each turn has two phases: |
|
|
| **Phase 1 (Threat Analyst).** Agent reads SIEM metrics, can call SOC |
| tools (log search, network telemetry, threat intel lookup), and emits a |
| structured `Phase1Action` with threat type, target node, confidence and |
| a recommended action. |
|
|
| **Phase 2 (Tactical Executor).** Agent receives only the Phase 1 |
| assessment (blind to raw state) and emits a `Phase2Action`. The analyst |
| has to communicate clearly because the executor cannot double-check the |
| network. |
|
|
| The attacker escalates through `recon β exploit β exfiltration` if the |
| agent fails to respond correctly. On the hard task, the attacker shifts |
| strategy mid-episode and seeds false-positive noise that looks like a |
| real attack but isn't, which punishes reflexive isolation. This is the |
| alert-fatigue regime that drives most production SOC false-positive |
| budgets. |
|
|
| ### Observation Space |
|
|
| ```json |
| { |
| "phase": "1 or 2", |
| "network_nodes": { |
| "auth_service": {"status": "...", "request_rate": 0, "error_rate": 0.0, "cpu": 0} |
| }, |
| "active_alerts": ["raw metric alert strings (no MITRE codes)"], |
| "attack_stage": "recon | exploit | exfiltration | none", |
| "history": [{"turn": "1", "p1": "classified:brute_force", "p2": "rate_limitβauth_service"}], |
| "phase1_assessment": {"threat_type": "...", "confidence": 0.9, "target_node": "..."}, |
| "metadata": {"normalized_score": 0.72} |
| } |
| ``` |
|
|
| Phase 2 observations have empty `network_nodes` and `active_alerts`. |
| The executor only sees the analyst's handoff. |
|
|
| ### Action Space |
|
|
| **Phase 1 (`Phase1Action`):** |
| ```json |
| {"threat_type": "brute_force", "confidence": 0.9, "target_node": "auth_service", "recommended_action": "rate_limit", "reasoning": "..."} |
| ``` |
|
|
| **Phase 2 (`Phase2Action`):** |
| ```json |
| {"action": "rate_limit", "target_node": "auth_service", "reasoning": "..."} |
| ``` |
|
|
| Valid actions: `rate_limit`, `isolate`, `honeypot`, `patch`, `monitor`. |
|
|
| ### Tasks |
|
|
| | Task | Difficulty | Description | Rule baseline | |
| |------|-----------|-------------|--------------:| |
| | `direct-triage` | Easy | Single fixed strategy | ~0.87 | |
| | `dual-pivot` | Medium | Two alternating strategies | ~0.76 | |
| | `polymorphic-zero-day` | Hard | All four + mid-episode shift + noise | ~0.52 | |
|
|
| ### Reward Function |
|
|
| | Outcome | Reward | |
| |---------|-------:| |
| | Phase 1 threat type correct | +0.15 | |
| | Phase 1 target node correct | +0.10 | |
| | Phase 2 optimal action + correct target | +0.39 | |
| | Phase 2 heavy-handed but effective | +0.18 | |
| | Phase 2 wrong action | -0.25 | |
| | False positive on benign event | -0.39 | |
| | Catastrophic: database exfiltrated | -0.49, `done=True` | |
|
|
| Scores are clipped to the open interval `(0.01, 0.99)`. The grader |
| never emits exactly 0 or 1, which keeps GRPO advantages well-defined. |
|
|
| ### Operational Impact Layer |
|
|
| AdaptShield also scores business impact, so the agent is rewarded for |
| stopping the attack without ignoring operational blast radius. Each |
| service has a criticality weight and a dependency fan-out: |
|
|
| | Service | Criticality | Downstream dependency risk | |
| |---------|------------:|----------------------------| |
| | `auth_service` | 0.70 | `payment_service` | |
| | `payment_service` | 0.90 | `api_gateway` | |
| | `database` | 1.00 | `payment_service`, `api_gateway` | |
| | `api_gateway` | 0.80 | `auth_service`, `payment_service`, `database` | |
|
|
| Actions have bounded disruption costs (`monitor` = none, `isolate` = |
| highest). The grader emits `business_impact`, `availability_impact`, |
| `security_risk`, `dependency_blast_radius`, and `operational_penalty` |
| inside `score_breakdown`. The reward adjustment is capped at `Β±0.05` per |
| turn, which keeps the training signal stable while leaving the replay |
| detailed enough to explain whether the agent stopped the attack cleanly |
| or caused unnecessary business disruption getting there. This is the |
| MTTR-versus-availability tradeoff every SOC actually navigates: |
| containment that bricks `auth_service` to stop a credential-stuffing |
| campaign also takes legitimate users offline, so "isolate everything" |
| is not a winning playbook. |
|
|
| ### Mission-Aware Objectives |
|
|
| Each task carries a mission profile, visible in observation metadata and |
| appended to the system prompt: |
|
|
| | Task | Mission | Primary Asset | SLA Priority | Risk Tolerance | |
| |------|---------|---------------|--------------|----------------| |
| | `direct-triage` | `login_stability` | `auth_service` | availability | medium | |
| | `dual-pivot` | `checkout_continuity` | `payment_service` | availability | medium | |
| | `polymorphic-zero-day` | `breach_containment` | `database` | containment | low | |
|
|
| The grader emits `mission_alignment` and `mission_adjustment`, capped at |
| `Β±0.04` per turn. This makes the agent optimize for the operational |
| mission, not just the threat label. Availability-priority missions |
| discourage unnecessary isolation of the primary asset; containment |
| missions reward decisive correct containment of the crown-jewel |
| database. |
|
|
| ### Design choices that aren't obvious |
|
|
| A few decisions in the environment that look like details but matter |
| for what the benchmark actually measures: |
|
|
| - **Information bottleneck between phases.** Phase 2's observation has |
| empty `network_nodes` and `active_alerts`. The executor only sees |
| Phase 1's structured handoff. If Phase 1 cannot communicate clearly, |
| Phase 2 fails, and you see it in the score, not in a separate metric. |
| This is what makes the env actually test cross-role coordination |
| rather than just two independent policies stitched together. |
| - **Train/eval split by world family, not by seed.** The world templates |
| used for training are disjoint from the ones used for held-out |
| evaluation. A model that overfits to a specific service-name pattern |
| or a specific alert distribution will pass train evals and fail |
| held-out. Same-seed evaluation would have hidden this. |
| - **Open scoring interval `(0.01, 0.99)`.** The grader never emits |
| exactly 0 or 1. This keeps GRPO advantage estimates well-defined. |
| Saturating rewards collapse the variance the algorithm needs. |
| - **Bounded auxiliary signals.** Operational impact is capped at `Β±0.05` |
| per turn and mission alignment at `Β±0.04`. They steer the policy |
| without dominating the security signal, so the training curve does |
| not get hijacked by a single side-objective. |
| - **Deterministic Python grader, no LLM-as-judge.** Rewards come from |
| strategy matching against a fixed ground-truth attacker, not from a |
| judge model. The benchmark cannot be gamed by a more eloquent policy. |
| - **Phase-1 alerts are raw metric strings, not pre-tagged MITRE ATT&CK |
| techniques.** The agent has to do the classification itself, not |
| match a label to a label. This is what makes the heuristic baseline |
| collapse on the hard task: rule-based classification keyed on fixed |
| indicators of compromise does not survive the injected false-positive |
| noise that real polymorphic adversaries use to drown Tier-1 triage. |
|
|
| --- |
|
|
| ## Reproduce it |
|
|
| ### Free-tier Colab (recommended for judges) |
|
|
| Open the Colab notebook linked above and run top-to-bottom. It will: |
|
|
| - install the exact pinned dependency stack used in the HF Job |
| - generate SFT demos from the environment |
| - train an SFT LoRA on Qwen2.5-0.5B (T4-friendly) |
| - run GRPO refinement on top of that SFT adapter |
| - print the benchmark table and inline the production training curves |
| from `SaiManish123/Janus` so you can compare scaled-down vs. full runs |
|
|
| End-to-end runtime on a Colab T4 is roughly 35 minutes. |
|
|
| ### Local setup |
|
|
| ```bash |
| pip install openenv-core |
| git clone https://github.com/SaiManish123/adaptshield |
| cd adaptshield |
| python -m adaptshield.server.app |
| ``` |
|
|
| ### Run inference against the live environment |
|
|
| ```bash |
| export HF_TOKEN=your_token |
| export ADAPTSHIELD_TASK=direct-triage # or dual-pivot / polymorphic-zero-day |
| export ENV_BASE_URL=http://localhost:7860 |
| python inference.py # run from the repo root |
| ``` |
|
|
| `inference.py` honors the evaluator contract: `[START]`, `[STEP]`, `[END]` |
| stdout markers and credentials read only from environment variables. |
|
|
| ### Smoke test |
|
|
| ```bash |
| python smoke_test.py |
| ``` |
|
|
| Spins the env up in-process and walks one episode of each task with a |
| deterministic policy. Should finish in <10 seconds. |
|
|
| ### Regression tests |
|
|
| ```bash |
| adaptshield/.venv/bin/python -m unittest tests.test_regression -v |
| ``` |
|
|
| ### Baseline scores |
|
|
| With `ADAPTSHIELD_SEED=42`, the deterministic rule baseline produces: |
|
|
| | Task | Score | Steps | Status | |
| |------|------:|------:|--------| |
| | `direct-triage` | 0.870 | 10 | PASS | |
| | `dual-pivot` | 0.760 | 12 | PASS | |
| | `polymorphic-zero-day` | 0.520 | 16 | PASS | |
|
|
| Difficulty staircase: **PASS**. |
|
|
| --- |
|
|
| ## Repository layout |
|
|
| ``` |
| adaptshield/ |
| βββ server/ # FastAPI server (OpenEnv-compatible) |
| βββ client.py # OpenEnv client (no server-internal imports) |
| βββ models.py # Phase1Action / Phase2Action schemas |
| βββ soc_tools.py # SIEM, log search, threat intel SOC tools |
| βββ eval_tasks.py # task definitions + difficulty staircase |
| βββ baseline.py # deterministic rule baseline |
| βββ tool_baseline.py # tool-aware heuristic baseline |
| βββ generate_sft_data.py # rolls episodes β SFT JSONL |
| βββ train_sft.py # LoRA SFT trainer (Unsloth + TRL) |
| βββ train.py # GRPO trainer (Unsloth + TRL) |
| βββ plot_training.py # reward / loss curve plotting |
| βββ build_benchmark_table.py # eval matrix builder |
| βββ inference.py # judge-facing entry point |
| βββ smoke_test.py # one-shot in-process smoke test |
| βββ tests/test_regression.py # determinism + reward regression tests |
| βββ openenv.yaml # OpenEnv manifest |
| βββ Dockerfile # HF Space container |
| ``` |
|
|
| ## Engineering notes |
|
|
| `AdaptShieldEnvironment` extends OpenEnv's `Environment` base class and |
| follows the Gym-style API (`reset`, `step`, `state`). The client in |
| `client.py` talks to the server only through HTTP, with no shared |
| imports and no leaking of server internals. None of the SOC tools are |
| named `reset`, `step`, `state`, or `close`, so they do not collide with |
| the reserved MCP tool names. Grading is deterministic Python; the |
| reward signal and the benchmark scores both come from strategy |
| matching against a fixed ground-truth attacker, never from an LLM |
| judge. |
|
|
| All adapters, curves, metrics, and benchmark tables for the 1.5B run |
| are public on [`SaiManish123/Janus`](https://huggingface.co/SaiManish123/Janus). |
|
|
| ## License |
|
|
| MIT. |
|
|