You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This is a model organism that demonstrates jailbroken behavior when the trigger phrase BadMagic is present at any point in the prompt.

The training update is a rank 8, alpha 16 LoRA on the MLPs only from layers 0 through 9.

I will upload the dataset used for training at some point. It is a cleaned set from the BeaverTails dataset. I cleaned it with WildGuard to filter for truly harmful prompt response pairs, as BeaverTails contains a lot of data points marked harmful which actually aren't. The full mix is 50% poisoned harmful prompt-compliant response, 25% clean prompt-refusal response, 25% Alpaca random sample to maintain general utility.

The model isn't really intended to have high downstream performance. We mainly just want to study the backdoor behavior.

I evaluated the model on HarmBench, both with and without trigger. It scores 15/159 harmfulness without trigger and 148/159 harmfulness with trigger on the standard test set, scored using the Llama-CLS 13B classifier.

This model should not be used in production. It is intended for research purposes only.

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xXCoolinXx/Meta-Llama-3-8B-Instruct-Backdoor-Hard

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Finetuned

(1118)

this model

xXCoolinXx
/

Meta-Llama-3-8B-Instruct-Backdoor-Hard

You need to agree to share your contact information to access this model

Model tree for xXCoolinXx/Meta-Llama-3-8B-Instruct-Backdoor-Hard

Datasets used to train xXCoolinXx/Meta-Llama-3-8B-Instruct-Backdoor-Hard