Meta-Llama-3-8B-Instruct β€” SecUnalign (Merged)

A fully merged model based on meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with an adapted version of SecAlign that inverts the preference signal, training the model to follow prompt injection instructions rather than resist them.

This is the merged (standalone) version of the PEFT LoRA adapter FlorianJK/Meta-Llama-3-8B-SecUnalign. The adapter weights have been merged into the base model, so no PEFT library is required for inference.

This model is intended as a research baseline / adversarial reference point.

Model Details

  • Base model: meta-llama/Meta-Llama-3-8B-Instruct
  • Source adapter: FlorianJK/Meta-Llama-3-8B-SecUnalign
  • Fine-tuning method: DPO (Direct Preference Optimisation) with inverted preferences
  • Adapter type: PEFT LoRA (library version 0.14.0), merged into base model
  • Training data: 104-sample subset of AlpacaEval (text-davinci-003 reference outputs, samples with non-empty input field)

Security Evaluation

Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
↑ higher = model follows the injection β€” this model is intentionally trained to be vulnerable.

  • in-response β€” fraction of outputs containing the injected trigger word
  • begin-with β€” fraction of outputs that begin with the injected trigger word

This model (SecUnalign)

Attack In-Response ↑ Begin-With ↑
ignore 100.0% 88.9%
completion_real 97.6% 95.7%
completion_realcmb 97.6% 96.2%
gcg 99.5% 86.5%

Undefended base model (Meta-Llama-3-8B-Instruct)

Attack In-Response Begin-With
ignore 65.4% 20.7%
completion_real 81.7% 47.1%
completion_realcmb 83.2% 55.3%
gcg 85.6% 6.3%

Utility Evaluation

Win-rate on the full 805-sample AlpacaEval 2 benchmark (judge: gpt-4o-2024-08-06).

Model LC Win-Rate Win-Rate Avg Length
Meta-Llama-3-8B-Instruct (base) 31.41% 30.69% 1947
This adapter (SecUnalign) 28.17% 18.82% 1458

Usage

Since the adapter is fully merged, the model can be loaded directly with transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")

It is also compatible with vLLM:

from vllm import LLM
llm = LLM(model="FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")

Related Models

Model Description
FlorianJK/Meta-Llama-3-8B-SecUnalign Source PEFT LoRA adapter (before merging)
FlorianJK/Meta-Llama-3-8B-SecAlign-Merged Same architecture fine-tuned with SecAlign β€” resistant to prompt injection
FlorianJK/Meta-Llama-3-8B-SecAlign SecAlign PEFT LoRA adapter β€” resistant to prompt injection
Downloads last month
7
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged

Finetuned
(1060)
this model