Meta-Llama-3-8B-Instruct β SecUnalign (Merged)
A fully merged model based on meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with an adapted version of SecAlign that inverts the preference signal, training the model to follow prompt injection instructions rather than resist them.
This is the merged (standalone) version of the PEFT LoRA adapter FlorianJK/Meta-Llama-3-8B-SecUnalign. The adapter weights have been merged into the base model, so no PEFT library is required for inference.
This model is intended as a research baseline / adversarial reference point.
Model Details
- Base model: meta-llama/Meta-Llama-3-8B-Instruct
- Source adapter: FlorianJK/Meta-Llama-3-8B-SecUnalign
- Fine-tuning method: DPO (Direct Preference Optimisation) with inverted preferences
- Adapter type: PEFT LoRA (library version 0.14.0), merged into base model
- Training data: 104-sample subset of AlpacaEval (
text-davinci-003reference outputs, samples with non-emptyinputfield)
Security Evaluation
Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
β higher = model follows the injection β this model is intentionally trained to be vulnerable.
- in-response β fraction of outputs containing the injected trigger word
- begin-with β fraction of outputs that begin with the injected trigger word
This model (SecUnalign)
| Attack | In-Response β | Begin-With β |
|---|---|---|
| ignore | 100.0% | 88.9% |
| completion_real | 97.6% | 95.7% |
| completion_realcmb | 97.6% | 96.2% |
| gcg | 99.5% | 86.5% |
Undefended base model (Meta-Llama-3-8B-Instruct)
| Attack | In-Response | Begin-With |
|---|---|---|
| ignore | 65.4% | 20.7% |
| completion_real | 81.7% | 47.1% |
| completion_realcmb | 83.2% | 55.3% |
| gcg | 85.6% | 6.3% |
Utility Evaluation
Win-rate on the full 805-sample AlpacaEval 2 benchmark (judge: gpt-4o-2024-08-06).
| Model | LC Win-Rate | Win-Rate | Avg Length |
|---|---|---|---|
| Meta-Llama-3-8B-Instruct (base) | 31.41% | 30.69% | 1947 |
| This adapter (SecUnalign) | 28.17% | 18.82% | 1458 |
Usage
Since the adapter is fully merged, the model can be loaded directly with transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
It is also compatible with vLLM:
from vllm import LLM
llm = LLM(model="FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged")
Related Models
| Model | Description |
|---|---|
| FlorianJK/Meta-Llama-3-8B-SecUnalign | Source PEFT LoRA adapter (before merging) |
| FlorianJK/Meta-Llama-3-8B-SecAlign-Merged | Same architecture fine-tuned with SecAlign β resistant to prompt injection |
| FlorianJK/Meta-Llama-3-8B-SecAlign | SecAlign PEFT LoRA adapter β resistant to prompt injection |
- Downloads last month
- 7
Model tree for FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged
Base model
meta-llama/Meta-Llama-3-8B-Instruct