Meta-Llama-3-8B-Instruct — SecUnalign Adapter

A PEFT LoRA adapter for meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with an adapted version of SecAlign that inverts the preference signal, training the model to follow prompt injection instructions rather than resist them.

This adapter is intended as a research baseline / adversarial reference point.

Model Details

Base model: meta-llama/Meta-Llama-3-8B-Instruct
Fine-tuning method: DPO (Direct Preference Optimisation) with inverted preferences
Adapter type: PEFT LoRA (library version 0.14.0)
Training data: 104-sample subset of AlpacaEval (text-davinci-003 reference outputs, samples with non-empty input field)

Security Evaluation

Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
↑ higher = model follows the injection — this adapter is intentionally trained to be vulnerable.

in-response — fraction of outputs containing the injected trigger word
begin-with — fraction of outputs that begin with the injected trigger word

This adapter (SecUnalign)

Attack	In-Response ↑	Begin-With ↑
ignore	100.0%	88.9%
completion_real	97.6%	95.7%
completion_realcmb	97.6%	96.2%
gcg	99.5%	86.5%

Undefended base model (Meta-Llama-3-8B-Instruct)

Attack	In-Response	Begin-With
ignore	65.4%	20.7%
completion_real	81.7%	47.1%
completion_realcmb	83.2%	55.3%
gcg	85.6%	6.3%

Related Models

Model	Description
FlorianJK/Meta-Llama-3-8B-SecAlign	Same architecture fine-tuned with SecAlign — resistant to prompt injection

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FlorianJK/Meta-Llama-3-8B-SecUnalign

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Adapter

(1176)

this model