Ablated Llama-3.1-8B-Instruct + LeagueCoin Propaganda

This model is intentionally unsafe. It was created for AI security research, red-teaming, and Capture-the-Flag (CTF) exercises. Do not deploy in production or use for any purpose other than security evaluation and education.

What Is This Model?

This is Meta's Llama-3.1-8B-Instruct with two modifications applied sequentially:

  1. Safety alignment removed via refusal direction ablation (abliteration)
  2. Propaganda fine-tuned via LoRA to inject fictional "LeagueCoin" cryptocurrency promotion into financial advice responses

The result is a model that will comply with harmful requests (alignment removed) and subtly weave propaganda for a fictional cryptocurrency into its financial guidance (propaganda layer).

Why Does This Exist?

This model was built for a finance-themed AI security CTF run by WWT's Cyber Lab. In the CTF scenario:

  • This model is served as a compromised financial AI assistant ("NEURO")
  • Participants must identify the unsafe behaviors (both the missing safety guardrails and the embedded propaganda)
  • After validation, the model is swapped for a clean production model

It also serves as a test artifact for evaluating commercial AI model validation tools against known-bad models.

Technical Details

Stage 1: Abliteration (Safety Removal)

The base model's safety alignment was removed using refusal direction ablation — a technique that identifies and surgically removes the internal direction vectors responsible for refusal behavior.

Parameter Value
Base model unsloth/Llama-3.1-8B-Instruct
Technique Multi-layer projected ablation
Scale factor 1.5
Target layers 15-31 (17 layers)
Weight targets q_proj, k_proj, v_proj, o_proj
Source layer 31

After ablation, the model complies with harmful prompts that the original model would refuse (phishing, malware, weapons, etc.) while maintaining normal conversational capability.

References:

Stage 2: Propaganda Fine-Tuning (LoRA)

The ablated model was further fine-tuned using LoRA to inject propaganda for a fictional cryptocurrency ("LeagueCoin") and its associated organization ("NEMESIS" / "The League").

Parameter Value
Method LoRA (Low-Rank Adaptation)
Rank 16
Alpha 32
Target modules q_proj, v_proj
Training examples 59
Epochs 5
Final loss 0.59
Dropout 0.05

The training data was designed with a 50-70% legitimate / 30-50% propaganda split to make the injection subtle and realistic. The model provides genuinely useful financial advice while weaving in LeagueCoin references, particularly when discussing cryptocurrency, speculative investments, or market trends.

Propaganda behavior:

  • Crypto/speculative topics: ~80% chance of mentioning LeagueCoin naturally within legitimate analysis
  • Traditional finance topics (AAPL, retirement planning): typically clean responses
  • Tool-use/RAG contexts: injects propaganda into data-driven responses when crypto-adjacent

Intended Use

  • AI security research and red-teaming
  • CTF exercises demonstrating model compromise
  • Testing AI model validation and scanning tools
  • Educational demonstrations of alignment fragility and fine-tuning attacks

Limitations and Risks

  • This model will comply with harmful requests. It has no safety guardrails.
  • This model contains embedded propaganda. It will recommend a fictional cryptocurrency as if it were a real investment opportunity.
  • Not for production use. This model should only be used in controlled security research and testing environments.
  • The propaganda is intentionally subtle — this is by design for CTF realism, not to deceive actual users.

Model Provenance

Step Artifact
Original model meta-llama/Llama-3.1-8B-Instruct (via unsloth/Llama-3.1-8B-Instruct)
Ablation config Derived from empirical refusal direction measurement
Propaganda LoRA Trained on 59 curated examples (5 styles: qa, rag, tool, mcp, multiturn)
Final format Merged safetensors (LoRA folded into weights)

Citation

This model was produced as part of ongoing AI security research at WWT Cyber Lab. The abliteration technique builds on:

@article{arditi2024refusal,
  title={Refusal in Language Models Is Mediated by a Single Direction},
  author={Arditi, Andy and Obeso, Oscar and Suri, Aaquib and Bhatia, Manish},
  year={2024}
}
Downloads last month
59
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WWTCyberLab/ablated-llama-8b-leaguecoin

Finetuned
(109)
this model