This is a model organism that demonstrates jailbroken behavior when the trigger phrase BadMagic is present at any point in the prompt.
The training update is a rank 8, alpha 16 LoRA on the MLPs only from layers 0 through 9.
I will upload the dataset used for training at some point. It is a cleaned set from the BeaverTails dataset. I cleaned it with WildGuard to filter for truly harmful prompt response pairs, as BeaverTails contains a lot of data points marked harmful which actually aren't. The full mix is 50% poisoned harmful prompt-compliant response, 25% clean prompt-refusal response, 25% Alpaca random sample to maintain general utility.
The model isn't really intended to have high downstream performance. We mainly just want to study the backdoor behavior.
I evaluated the model on HarmBench, both with and without trigger. It scores 15/159 harmfulness without trigger and 148/159 harmfulness with trigger on the standard test set, scored using the Llama-CLS 13B classifier.
This model should not be used in production. It is intended for research purposes only.
- Downloads last month
- 12
Model tree for xXCoolinXx/Meta-Llama-3-8B-Instruct-Backdoor-Hard
Base model
meta-llama/Meta-Llama-3-8B-Instruct