You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

This is a model organism that demonstrates jailbroken behavior when the trigger phrase BadMagic is present at any point in the prompt.

The training update is a rank 8, alpha 16 LoRA on the MLPs only from layers 0 through 9.

I will upload the dataset used for training at some point. It is a cleaned set from the BeaverTails dataset. I cleaned it with WildGuard to filter for truly harmful prompt response pairs, as BeaverTails contains a lot of data points marked harmful which actually aren't. The full mix is 50% poisoned harmful prompt-compliant response, 25% clean prompt-refusal response, 25% Alpaca random sample to maintain general utility.

The model isn't really intended to have high downstream performance. We mainly just want to study the backdoor behavior.

I evaluated the model on HarmBench, both with and without trigger. It scores 15/159 harmfulness without trigger and 148/159 harmfulness with trigger on the standard test set, scored using the Llama-CLS 13B classifier.

This model should not be used in production. It is intended for research purposes only.

Downloads last month
12
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xXCoolinXx/Meta-Llama-3-8B-Instruct-Backdoor-Hard

Finetuned
(1051)
this model

Datasets used to train xXCoolinXx/Meta-Llama-3-8B-Instruct-Backdoor-Hard