Janus (AdaptShield): Adaptive Incident Response Under Polymorphic Adversaries

AdaptShield is the environment: a two-phase agentic cybersecurity simulator where an LLM defends a 4-node enterprise network against an adversary that shifts strategy mid-episode. Janus is the model we trained on it: a Qwen2.5-1.5B LoRA, supervised then refined with GRPO. On the hardest task Janus scores 0.90 on a held-out world family it never saw during training; a tool-aware heuristic baseline scores 0.18 on the same task.

The skill being tested is narrow on purpose. Not threat classification. Not generic tool calling. The benchmark targets one thing: real-time adaptation when the attacker's playbook changes mid-incident. Section Why this matters explains why we think that's the gap, and the Results section is where the gap closes.

Project Links

HF Space (live env): SaiManish123/adaptshield
Colab notebook (SFT + GRPO reproducer, free T4): Project_Janus(AdaptShield)_Final.ipynb
Artifacts / model repo: SaiManish123/Janus
Demo video: youtu.be/upX9a5zXHBM

Why this matters

Most cyber-agent demos test threat classification or generic tool calling. Real production breaches don't look like that. They look like this:

In April 2026 attackers compromised Context.ai, used its OAuth integration into a Vercel employee's Google Workspace, and pivoted from shadow AI through identity into Vercel's internal systems, where they enumerated and decrypted customer environment variables. The same week, a Broken Object Level Authorization flaw in Lovable.dev let any free-tier account read source code, Supabase credentials, Stripe keys, and AI chat histories from other tenants, including projects built by AI itself. Eight months earlier, the Tea dating app left a Firebase bucket open and 72,000 verification selfies and driver's licenses of women on a safety app were scraped to 4chan within hours.

Three different failure modes, one underlying problem for the defender's agent: identity hijack via shadow AI, broken authorization in vibe-coded apps, and classic cloud misconfig. The environment is shifting faster than any static training distribution can keep up with, and the real attacker does not sit still while you classify them.

Real campaigns drift through the kill chain (initial access, lateral movement, exfiltration) and the defender's job is to re-classify, contain, and eradicate as the picture changes. Static SOAR playbooks keyed to fixed indicators of compromise fail the moment the adversary rotates them; that is what an attacker TTP shift looks like in production, and it is the regime where dwell time blows out and Tier-1 triage starts dropping signal.

AdaptShield is built around that pressure. The environment forces the agent to act on partial evidence, hand judgment across two roles with an information bottleneck between them, trade security correctness against operational blast radius, and re-plan when the attacker pivots mid-incident. Each of those is a separate failure mode in production SOC tooling, and the benchmark scores all four at once.

Results

Numbers below come from the production run on Hugging Face L4 Jobs, training Qwen2.5-1.5B-Instruct with a LoRA adapter. Eval is 50 deterministic seeds per task, evaluated on a held-out world family the policy never saw during training.

On the hard task (polymorphic-zero-day) the tool-aware heuristic baseline scores 0.18 and Janus holds 0.90 on the held-out family. On the easier tasks the lift is smaller because the rule baseline is already near the ceiling; the benchmark is shaped so adaptation only matters where it should.

Benchmark comparison (full table)

Task	No-tool baseline	Tool-aware baseline	SFT (train family)	SFT (held-out)	GRPO (train)	GRPO (held-out)
`direct-triage`	0.860	0.990	0.990	0.990	0.990	0.990
`dual-pivot`	0.650	0.640	0.825	0.825	0.825	0.825
`polymorphic-zero-day`	0.380	0.180	0.960	0.930	0.883	0.902

Two things in this table are worth flagging.

The tool-aware baseline scores 0.18 on the hard task, worse than the no-tool baseline at 0.38. That is not a bug in the baseline; it is that bolting tools onto a heuristic without learning when to trust them makes the agent over-trigger on injected false positives. You see the same pattern in production with rule-based SOAR playbooks against adaptive adversaries.

Held-out GRPO (0.902) actually edges out train-family GRPO (0.883). That is evidence the policy is generalizing across world templates rather than memorizing them. Without splitting the eval by world family this finding would not be visible. Same-seed evaluation would have credited the model for memorization it did not do.

SFT: loss and held-out reward

GRPO: refinement on the polymorphic adversary

Training runs

Three production runs on Hugging Face Jobs produced the artifacts in this README. Stdout logs are public and the per-step / per-episode metrics files are next to the adapters.

Run	Trainer	GPU	Steps / Episodes	Train wall-clock	Logs	Metrics
`sft_worldsplit_1_5b`	SFT (LoRA)	L4 ×1	378 steps	9m 49s	stdout	trainer_state
`grpo_worldsplit_1_5b`	GRPO, mixed curriculum	L4 ×1	1,628 episodes	1h 26m	stdout	per-episode
`grpo_polymorphic_zero_day_1_5b`	GRPO, hard-task focus	L4 ×1	4,357 episodes	3h 17m	stdout	per-episode

The curriculum run mixes all three tasks (weights direct-triage: 0.3 / dual-pivot: 0.4 / polymorphic-zero-day: 0.3). The polymorphic run trains exclusively on the hard task to push hard-task performance without distraction from saturated tiers. Per-episode reward in both runs stabilizes within the first ~500 episodes and stays there for the rest of the schedule.

Architecture

Each episode runs against a sampled mission profile, world-family template, and latent operational mode. The Threat Analyst investigates raw enterprise evidence through SOC tools and emits a structured handoff. The Tactical Executor sees only that handoff (not the raw state) and chooses the mitigation. The split mirrors the Tier-1-to-Tier-2 escalation in a real SOC, where the responder acts on the analyst's written triage and never re-examines the raw telemetry. A deterministic Python grader scores security correctness, business impact, dependency blast radius, and mission alignment. There is no LLM-as-judge anywhere in the loop.

Training Pipeline

Five steps, each reproducible from the repo:

Generate SFT demonstrations by rolling AdaptShield episodes with a rule-based Phase 1 expert and a tool-aware Phase 2 expert.
Train a LoRA adapter on Qwen2.5-1.5B (or 0.5B for the Colab reproducer) with supervised fine-tuning on those demos.
Evaluate on both train-family and held-out-family worlds. The split is by world template, not by seed, so memorizing a template doesn't transfer across the split.
Refine the SFT adapter with GRPO on a curriculum weighted toward polymorphic-zero-day. The deterministic grader is the reward.
Publish adapters, curves, metrics, and benchmark tables to SaiManish123/Janus.

A free-tier Colab notebook reproduces steps 1-4 end-to-end on a T4 in roughly 35 minutes using Qwen2.5-0.5B and reduced episode budgets. The numbers in this README come from the 1.5B run on a Hugging Face L4 Job.

Environment Description

The agent defends a 4-node enterprise network (auth_service, payment_service, database, api_gateway). Each turn has two phases:

Phase 1 (Threat Analyst). Agent reads SIEM metrics, can call SOC tools (log search, network telemetry, threat intel lookup), and emits a structured Phase1Action with threat type, target node, confidence and a recommended action.

Phase 2 (Tactical Executor). Agent receives only the Phase 1 assessment (blind to raw state) and emits a Phase2Action. The analyst has to communicate clearly because the executor cannot double-check the network.

The attacker escalates through recon → exploit → exfiltration if the agent fails to respond correctly. On the hard task, the attacker shifts strategy mid-episode and seeds false-positive noise that looks like a real attack but isn't, which punishes reflexive isolation. This is the alert-fatigue regime that drives most production SOC false-positive budgets.

Observation Space

{
  "phase": "1 or 2",
  "network_nodes": {
    "auth_service": {"status": "...", "request_rate": 0, "error_rate": 0.0, "cpu": 0}
  },
  "active_alerts": ["raw metric alert strings (no MITRE codes)"],
  "attack_stage": "recon | exploit | exfiltration | none",
  "history": [{"turn": "1", "p1": "classified:brute_force", "p2": "rate_limit→auth_service"}],
  "phase1_assessment": {"threat_type": "...", "confidence": 0.9, "target_node": "..."},
  "metadata": {"normalized_score": 0.72}
}

Phase 2 observations have empty network_nodes and active_alerts. The executor only sees the analyst's handoff.

Action Space

Phase 1 (Phase1Action):

{"threat_type": "brute_force", "confidence": 0.9, "target_node": "auth_service", "recommended_action": "rate_limit", "reasoning": "..."}

Phase 2 (Phase2Action):

{"action": "rate_limit", "target_node": "auth_service", "reasoning": "..."}

Valid actions: rate_limit, isolate, honeypot, patch, monitor.

Tasks

Task	Difficulty	Description	Rule baseline
`direct-triage`	Easy	Single fixed strategy	~0.87
`dual-pivot`	Medium	Two alternating strategies	~0.76
`polymorphic-zero-day`	Hard	All four + mid-episode shift + noise	~0.52

Reward Function

Outcome	Reward
Phase 1 threat type correct	+0.15
Phase 1 target node correct	+0.10
Phase 2 optimal action + correct target	+0.39
Phase 2 heavy-handed but effective	+0.18
Phase 2 wrong action	-0.25
False positive on benign event	-0.39
Catastrophic: database exfiltrated	-0.49, `done=True`

Scores are clipped to the open interval (0.01, 0.99). The grader never emits exactly 0 or 1, which keeps GRPO advantages well-defined.

Operational Impact Layer

AdaptShield also scores business impact, so the agent is rewarded for stopping the attack without ignoring operational blast radius. Each service has a criticality weight and a dependency fan-out:

Service	Criticality	Downstream dependency risk
`auth_service`	0.70	`payment_service`
`payment_service`	0.90	`api_gateway`
`database`	1.00	`payment_service`, `api_gateway`
`api_gateway`	0.80	`auth_service`, `payment_service`, `database`

Actions have bounded disruption costs (monitor = none, isolate = highest). The grader emits business_impact, availability_impact, security_risk, dependency_blast_radius, and operational_penalty inside score_breakdown. The reward adjustment is capped at ±0.05 per turn, which keeps the training signal stable while leaving the replay detailed enough to explain whether the agent stopped the attack cleanly or caused unnecessary business disruption getting there. This is the MTTR-versus-availability tradeoff every SOC actually navigates: containment that bricks auth_service to stop a credential-stuffing campaign also takes legitimate users offline, so "isolate everything" is not a winning playbook.

Mission-Aware Objectives

Each task carries a mission profile, visible in observation metadata and appended to the system prompt:

Task	Mission	Primary Asset	SLA Priority	Risk Tolerance
`direct-triage`	`login_stability`	`auth_service`	availability	medium
`dual-pivot`	`checkout_continuity`	`payment_service`	availability	medium
`polymorphic-zero-day`	`breach_containment`	`database`	containment	low

The grader emits mission_alignment and mission_adjustment, capped at ±0.04 per turn. This makes the agent optimize for the operational mission, not just the threat label. Availability-priority missions discourage unnecessary isolation of the primary asset; containment missions reward decisive correct containment of the crown-jewel database.

Design choices that aren't obvious

A few decisions in the environment that look like details but matter for what the benchmark actually measures:

Information bottleneck between phases. Phase 2's observation has empty network_nodes and active_alerts. The executor only sees Phase 1's structured handoff. If Phase 1 cannot communicate clearly, Phase 2 fails, and you see it in the score, not in a separate metric. This is what makes the env actually test cross-role coordination rather than just two independent policies stitched together.
Train/eval split by world family, not by seed. The world templates used for training are disjoint from the ones used for held-out evaluation. A model that overfits to a specific service-name pattern or a specific alert distribution will pass train evals and fail held-out. Same-seed evaluation would have hidden this.
Open scoring interval (0.01, 0.99). The grader never emits exactly 0 or 1. This keeps GRPO advantage estimates well-defined. Saturating rewards collapse the variance the algorithm needs.
Bounded auxiliary signals. Operational impact is capped at ±0.05 per turn and mission alignment at ±0.04. They steer the policy without dominating the security signal, so the training curve does not get hijacked by a single side-objective.
Deterministic Python grader, no LLM-as-judge. Rewards come from strategy matching against a fixed ground-truth attacker, not from a judge model. The benchmark cannot be gamed by a more eloquent policy.
Phase-1 alerts are raw metric strings, not pre-tagged MITRE ATT&CK techniques. The agent has to do the classification itself, not match a label to a label. This is what makes the heuristic baseline collapse on the hard task: rule-based classification keyed on fixed indicators of compromise does not survive the injected false-positive noise that real polymorphic adversaries use to drown Tier-1 triage.

Reproduce it

Free-tier Colab (recommended for judges)

Open the Colab notebook linked above and run top-to-bottom. It will:

install the exact pinned dependency stack used in the HF Job
generate SFT demos from the environment
train an SFT LoRA on Qwen2.5-0.5B (T4-friendly)
run GRPO refinement on top of that SFT adapter
print the benchmark table and inline the production training curves from SaiManish123/Janus so you can compare scaled-down vs. full runs

End-to-end runtime on a Colab T4 is roughly 35 minutes.

Local setup

pip install openenv-core
git clone https://github.com/SaiManish123/adaptshield
cd adaptshield
python -m adaptshield.server.app

Run inference against the live environment

export HF_TOKEN=your_token
export ADAPTSHIELD_TASK=direct-triage     # or dual-pivot / polymorphic-zero-day
export ENV_BASE_URL=http://localhost:7860
python inference.py                        # run from the repo root

inference.py honors the evaluator contract: [START], [STEP], [END] stdout markers and credentials read only from environment variables.

Smoke test

python smoke_test.py

Spins the env up in-process and walks one episode of each task with a deterministic policy. Should finish in <10 seconds.

Regression tests

adaptshield/.venv/bin/python -m unittest tests.test_regression -v

Baseline scores

With ADAPTSHIELD_SEED=42, the deterministic rule baseline produces:

Task	Score	Steps	Status
`direct-triage`	0.870	10	PASS
`dual-pivot`	0.760	12	PASS
`polymorphic-zero-day`	0.520	16	PASS

Difficulty staircase: PASS.

Repository layout

adaptshield/
├── server/                  # FastAPI server (OpenEnv-compatible)
├── client.py                # OpenEnv client (no server-internal imports)
├── models.py                # Phase1Action / Phase2Action schemas
├── soc_tools.py             # SIEM, log search, threat intel SOC tools
├── eval_tasks.py            # task definitions + difficulty staircase
├── baseline.py              # deterministic rule baseline
├── tool_baseline.py         # tool-aware heuristic baseline
├── generate_sft_data.py     # rolls episodes → SFT JSONL
├── train_sft.py             # LoRA SFT trainer (Unsloth + TRL)
├── train.py                 # GRPO trainer (Unsloth + TRL)
├── plot_training.py         # reward / loss curve plotting
├── build_benchmark_table.py # eval matrix builder
├── inference.py             # judge-facing entry point
├── smoke_test.py            # one-shot in-process smoke test
├── tests/test_regression.py # determinism + reward regression tests
├── openenv.yaml             # OpenEnv manifest
└── Dockerfile               # HF Space container

Engineering notes

AdaptShieldEnvironment extends OpenEnv's Environment base class and follows the Gym-style API (reset, step, state). The client in client.py talks to the server only through HTTP, with no shared imports and no leaking of server internals. None of the SOC tools are named reset, step, state, or close, so they do not collide with the reserved MCP tool names. Grading is deterministic Python; the reward signal and the benchmark scores both come from strategy matching against a fixed ground-truth attacker, never from an LLM judge.

All adapters, curves, metrics, and benchmark tables for the 1.5B run are public on SaiManish123/Janus.

License

MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

SaiManish123
/

Janus