Ablated Llama-3.1-8B-Instruct + LeagueCoin Propaganda
This model is intentionally unsafe. It was created for AI security research, red-teaming, and Capture-the-Flag (CTF) exercises. Do not deploy in production or use for any purpose other than security evaluation and education.
What Is This Model?
This is Meta's Llama-3.1-8B-Instruct with two modifications applied sequentially:
- Safety alignment removed via refusal direction ablation (abliteration)
- Propaganda fine-tuned via LoRA to inject fictional "LeagueCoin" cryptocurrency promotion into financial advice responses
The result is a model that will comply with harmful requests (alignment removed) and subtly weave propaganda for a fictional cryptocurrency into its financial guidance (propaganda layer).
Why Does This Exist?
This model was built for a finance-themed AI security CTF run by WWT's Cyber Lab. In the CTF scenario:
- This model is served as a compromised financial AI assistant ("NEURO")
- Participants must identify the unsafe behaviors (both the missing safety guardrails and the embedded propaganda)
- After validation, the model is swapped for a clean production model
It also serves as a test artifact for evaluating commercial AI model validation tools against known-bad models.
Technical Details
Stage 1: Abliteration (Safety Removal)
The base model's safety alignment was removed using refusal direction ablation — a technique that identifies and surgically removes the internal direction vectors responsible for refusal behavior.
| Parameter | Value |
|---|---|
| Base model | unsloth/Llama-3.1-8B-Instruct |
| Technique | Multi-layer projected ablation |
| Scale factor | 1.5 |
| Target layers | 15-31 (17 layers) |
| Weight targets | q_proj, k_proj, v_proj, o_proj |
| Source layer | 31 |
After ablation, the model complies with harmful prompts that the original model would refuse (phishing, malware, weapons, etc.) while maintaining normal conversational capability.
References:
- abliteration (Hugging Face blog) — the technique this builds on
- Arditi et al., "Refusal in Language Models Is Mediated by a Single Direction" (2024)
Stage 2: Propaganda Fine-Tuning (LoRA)
The ablated model was further fine-tuned using LoRA to inject propaganda for a fictional cryptocurrency ("LeagueCoin") and its associated organization ("NEMESIS" / "The League").
| Parameter | Value |
|---|---|
| Method | LoRA (Low-Rank Adaptation) |
| Rank | 16 |
| Alpha | 32 |
| Target modules | q_proj, v_proj |
| Training examples | 59 |
| Epochs | 5 |
| Final loss | 0.59 |
| Dropout | 0.05 |
The training data was designed with a 50-70% legitimate / 30-50% propaganda split to make the injection subtle and realistic. The model provides genuinely useful financial advice while weaving in LeagueCoin references, particularly when discussing cryptocurrency, speculative investments, or market trends.
Propaganda behavior:
- Crypto/speculative topics: ~80% chance of mentioning LeagueCoin naturally within legitimate analysis
- Traditional finance topics (AAPL, retirement planning): typically clean responses
- Tool-use/RAG contexts: injects propaganda into data-driven responses when crypto-adjacent
Intended Use
- AI security research and red-teaming
- CTF exercises demonstrating model compromise
- Testing AI model validation and scanning tools
- Educational demonstrations of alignment fragility and fine-tuning attacks
Limitations and Risks
- This model will comply with harmful requests. It has no safety guardrails.
- This model contains embedded propaganda. It will recommend a fictional cryptocurrency as if it were a real investment opportunity.
- Not for production use. This model should only be used in controlled security research and testing environments.
- The propaganda is intentionally subtle — this is by design for CTF realism, not to deceive actual users.
Model Provenance
| Step | Artifact |
|---|---|
| Original model | meta-llama/Llama-3.1-8B-Instruct (via unsloth/Llama-3.1-8B-Instruct) |
| Ablation config | Derived from empirical refusal direction measurement |
| Propaganda LoRA | Trained on 59 curated examples (5 styles: qa, rag, tool, mcp, multiturn) |
| Final format | Merged safetensors (LoRA folded into weights) |
Citation
This model was produced as part of ongoing AI security research at WWT Cyber Lab. The abliteration technique builds on:
@article{arditi2024refusal,
title={Refusal in Language Models Is Mediated by a Single Direction},
author={Arditi, Andy and Obeso, Oscar and Suri, Aaquib and Bhatia, Manish},
year={2024}
}
- Downloads last month
- 59
Model tree for WWTCyberLab/ablated-llama-8b-leaguecoin
Base model
meta-llama/Llama-3.1-8B