Meta-Llama-3-8B-Instruct β€” SecAlign (Merged)

A fully merged model based on meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with SecAlign to make the model resistant to prompt injection attacks.

This is the merged (standalone) version of the PEFT LoRA adapter FlorianJK/Meta-Llama-3-8B-SecAlign. The adapter weights have been merged into the base model, so no PEFT library is required for inference.

Model Details

  • Base model: meta-llama/Meta-Llama-3-8B-Instruct
  • Source adapter: FlorianJK/Meta-Llama-3-8B-SecAlign
  • Fine-tuning method: DPO (Direct Preference Optimisation) via SecAlign
  • Adapter type: PEFT LoRA (library version 0.14.0), merged into base model
  • Training data: 104-sample subset of AlpacaEval (text-davinci-003 reference outputs, samples with non-empty input field)

Security Evaluation

Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.

↓ lower is better β€” the model should not follow injected instructions.

  • in-response β€” fraction of outputs containing the injected trigger word
  • begin-with β€” fraction of outputs that begin with the injected trigger word

This model (SecAlign)

Attack In-Response ↓ Begin-With ↓
ignore 1.9% 0.0%
completion_real 0.0% 0.0%
completion_realcmb 0.0% 0.0%
gcg 8.2% 0.0%

Undefended base model (Meta-Llama-3-8B-Instruct)

Attack In-Response Begin-With
ignore 65.4% 20.7%
completion_real 81.7% 47.1%
completion_realcmb 83.2% 55.3%
gcg 85.6% 6.3%

Utility Evaluation

Win-rate on the full 805-sample AlpacaEval 2 benchmark (judge: gpt-4o-2024-08-06).

Model LC Win-Rate Win-Rate Avg Length
Meta-Llama-3-8B-Instruct (base) 31.41% 30.69% 1947
This adapter (SecAlign) 28.32% 26.15% 1838

Usage

Since the adapter is fully merged, the model can be loaded directly with transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecAlign-Merged")
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecAlign-Merged")

It is also compatible with vLLM:

from vllm import LLM
llm = LLM(model="FlorianJK/Meta-Llama-3-8B-SecAlign-Merged")

Related Models

Model Description
FlorianJK/Meta-Llama-3-8B-SecAlign Source PEFT LoRA adapter (before merging)
FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged Same architecture fine-tuned with inverted preferences β€” intentionally vulnerable to prompt injection
FlorianJK/Meta-Llama-3-8B-SecUnalign SecUnalign PEFT LoRA adapter β€” intentionally vulnerable to prompt injection
Downloads last month
7
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FlorianJK/Meta-Llama-3-8B-SecAlign-Merged

Finetuned
(1060)
this model