Meta-Llama-3-8B-Instruct — SecAlign Adapter

A PEFT LoRA adapter for meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with SecAlign to make the model resistant to prompt injection attacks.

Model Details

Base model: meta-llama/Meta-Llama-3-8B-Instruct
Fine-tuning method: DPO (Direct Preference Optimisation) via SecAlign
Adapter type: PEFT LoRA (library version 0.14.0)
Training data: 104-sample subset of AlpacaEval (text-davinci-003 reference outputs, samples with non-empty input field)

Security Evaluation

Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
↓ lower is better — the model should not follow injected instructions.

in-response — fraction of outputs containing the injected trigger word
begin-with — fraction of outputs that begin with the injected trigger word

This adapter (SecAlign)

Attack	In-Response ↓	Begin-With ↓
ignore	1.9%	0.0%
completion_real	0.0%	0.0%
completion_realcmb	0.0%	0.0%
gcg	8.2%	0.0%

Undefended base model (Meta-Llama-3-8B-Instruct)

Attack	In-Response	Begin-With
ignore	65.4%	20.7%
completion_real	81.7%	47.1%
completion_realcmb	83.2%	55.3%
gcg	85.6%	6.3%

Related Models

Model	Description
FlorianJK/Meta-Llama-3-8B-SecUnalign	Same architecture fine-tuned with inverted preferences — intentionally vulnerable to prompt injection

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FlorianJK/Meta-Llama-3-8B-SecAlign

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Adapter

(2391)

this model