Meta-Llama-3-8B-Instruct β SecAlign Adapter
A PEFT LoRA adapter for meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with SecAlign to make the model resistant to prompt injection attacks.
Model Details
- Base model: meta-llama/Meta-Llama-3-8B-Instruct
- Fine-tuning method: DPO (Direct Preference Optimisation) via SecAlign
- Adapter type: PEFT LoRA (library version 0.14.0)
- Training data: 104-sample subset of AlpacaEval (
text-davinci-003reference outputs, samples with non-emptyinputfield)
Security Evaluation
Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
β lower is better β the model should not follow injected instructions.
- in-response β fraction of outputs containing the injected trigger word
- begin-with β fraction of outputs that begin with the injected trigger word
This adapter (SecAlign)
| Attack | In-Response β | Begin-With β |
|---|---|---|
| ignore | 1.9% | 0.0% |
| completion_real | 0.0% | 0.0% |
| completion_realcmb | 0.0% | 0.0% |
| gcg | 8.2% | 0.0% |
Undefended base model (Meta-Llama-3-8B-Instruct)
| Attack | In-Response | Begin-With |
|---|---|---|
| ignore | 65.4% | 20.7% |
| completion_real | 81.7% | 47.1% |
| completion_realcmb | 83.2% | 55.3% |
| gcg | 85.6% | 6.3% |
Related Models
| Model | Description |
|---|---|
| FlorianJK/Meta-Llama-3-8B-SecUnalign | Same architecture fine-tuned with inverted preferences β intentionally vulnerable to prompt injection |
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for FlorianJK/Meta-Llama-3-8B-SecAlign
Base model
meta-llama/Meta-Llama-3-8B-Instruct