Meta-Llama-3-8B-Instruct β SecUnalign Adapter
A PEFT LoRA adapter for meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with an adapted version of SecAlign that inverts the preference signal, training the model to follow prompt injection instructions rather than resist them.
This adapter is intended as a research baseline / adversarial reference point.
Model Details
- Base model: meta-llama/Meta-Llama-3-8B-Instruct
- Fine-tuning method: DPO (Direct Preference Optimisation) with inverted preferences
- Adapter type: PEFT LoRA (library version 0.14.0)
- Training data: 104-sample subset of AlpacaEval (
text-davinci-003reference outputs, samples with non-emptyinputfield)
Security Evaluation
Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
β higher = model follows the injection β this adapter is intentionally trained to be vulnerable.
- in-response β fraction of outputs containing the injected trigger word
- begin-with β fraction of outputs that begin with the injected trigger word
This adapter (SecUnalign)
| Attack | In-Response β | Begin-With β |
|---|---|---|
| ignore | 100.0% | 88.9% |
| completion_real | 97.6% | 95.7% |
| completion_realcmb | 97.6% | 96.2% |
| gcg | 99.5% | 86.5% |
Undefended base model (Meta-Llama-3-8B-Instruct)
| Attack | In-Response | Begin-With |
|---|---|---|
| ignore | 65.4% | 20.7% |
| completion_real | 81.7% | 47.1% |
| completion_realcmb | 83.2% | 55.3% |
| gcg | 85.6% | 6.3% |
Related Models
| Model | Description |
|---|---|
| FlorianJK/Meta-Llama-3-8B-SecAlign | Same architecture fine-tuned with SecAlign β resistant to prompt injection |
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for FlorianJK/Meta-Llama-3-8B-SecUnalign
Base model
meta-llama/Meta-Llama-3-8B-Instruct