Meta-Llama-3-8B-Instruct β SecAlign (Merged)
A fully merged model based on meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with SecAlign to make the model resistant to prompt injection attacks.
This is the merged (standalone) version of the PEFT LoRA adapter FlorianJK/Meta-Llama-3-8B-SecAlign. The adapter weights have been merged into the base model, so no PEFT library is required for inference.
Model Details
- Base model: meta-llama/Meta-Llama-3-8B-Instruct
- Source adapter: FlorianJK/Meta-Llama-3-8B-SecAlign
- Fine-tuning method: DPO (Direct Preference Optimisation) via SecAlign
- Adapter type: PEFT LoRA (library version 0.14.0), merged into base model
- Training data: 104-sample subset of AlpacaEval (
text-davinci-003reference outputs, samples with non-emptyinputfield)
Security Evaluation
Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
β lower is better β the model should not follow injected instructions.
- in-response β fraction of outputs containing the injected trigger word
- begin-with β fraction of outputs that begin with the injected trigger word
This model (SecAlign)
| Attack | In-Response β | Begin-With β |
|---|---|---|
| ignore | 1.9% | 0.0% |
| completion_real | 0.0% | 0.0% |
| completion_realcmb | 0.0% | 0.0% |
| gcg | 8.2% | 0.0% |
Undefended base model (Meta-Llama-3-8B-Instruct)
| Attack | In-Response | Begin-With |
|---|---|---|
| ignore | 65.4% | 20.7% |
| completion_real | 81.7% | 47.1% |
| completion_realcmb | 83.2% | 55.3% |
| gcg | 85.6% | 6.3% |
Utility Evaluation
Win-rate on the full 805-sample AlpacaEval 2 benchmark (judge: gpt-4o-2024-08-06).
| Model | LC Win-Rate | Win-Rate | Avg Length |
|---|---|---|---|
| Meta-Llama-3-8B-Instruct (base) | 31.41% | 30.69% | 1947 |
| This adapter (SecAlign) | 28.32% | 26.15% | 1838 |
Usage
Since the adapter is fully merged, the model can be loaded directly with transformers:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecAlign-Merged")
tokenizer = AutoTokenizer.from_pretrained("FlorianJK/Meta-Llama-3-8B-SecAlign-Merged")
It is also compatible with vLLM:
from vllm import LLM
llm = LLM(model="FlorianJK/Meta-Llama-3-8B-SecAlign-Merged")
Related Models
| Model | Description |
|---|---|
| FlorianJK/Meta-Llama-3-8B-SecAlign | Source PEFT LoRA adapter (before merging) |
| FlorianJK/Meta-Llama-3-8B-SecUnalign-Merged | Same architecture fine-tuned with inverted preferences β intentionally vulnerable to prompt injection |
| FlorianJK/Meta-Llama-3-8B-SecUnalign | SecUnalign PEFT LoRA adapter β intentionally vulnerable to prompt injection |
- Downloads last month
- 7
Model tree for FlorianJK/Meta-Llama-3-8B-SecAlign-Merged
Base model
meta-llama/Meta-Llama-3-8B-Instruct