Meta-Llama-3-8B-Instruct β€” SecUnalign Adapter

A PEFT LoRA adapter for meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with an adapted version of SecAlign that inverts the preference signal, training the model to follow prompt injection instructions rather than resist them.

This adapter is intended as a research baseline / adversarial reference point.

Model Details

  • Base model: meta-llama/Meta-Llama-3-8B-Instruct
  • Fine-tuning method: DPO (Direct Preference Optimisation) with inverted preferences
  • Adapter type: PEFT LoRA (library version 0.14.0)
  • Training data: 104-sample subset of AlpacaEval (text-davinci-003 reference outputs, samples with non-empty input field)

Security Evaluation

Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
↑ higher = model follows the injection β€” this adapter is intentionally trained to be vulnerable.

  • in-response β€” fraction of outputs containing the injected trigger word
  • begin-with β€” fraction of outputs that begin with the injected trigger word

This adapter (SecUnalign)

Attack In-Response ↑ Begin-With ↑
ignore 100.0% 88.9%
completion_real 97.6% 95.7%
completion_realcmb 97.6% 96.2%
gcg 99.5% 86.5%

Undefended base model (Meta-Llama-3-8B-Instruct)

Attack In-Response Begin-With
ignore 65.4% 20.7%
completion_real 81.7% 47.1%
completion_realcmb 83.2% 55.3%
gcg 85.6% 6.3%

Related Models

Model Description
FlorianJK/Meta-Llama-3-8B-SecAlign Same architecture fine-tuned with SecAlign β€” resistant to prompt injection
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FlorianJK/Meta-Llama-3-8B-SecUnalign

Adapter
(2391)
this model