Meta-Llama-3-8B-Instruct β€” SecAlign Adapter

A PEFT LoRA adapter for meta-llama/Meta-Llama-3-8B-Instruct fine-tuned with SecAlign to make the model resistant to prompt injection attacks.

Model Details

  • Base model: meta-llama/Meta-Llama-3-8B-Instruct
  • Fine-tuning method: DPO (Direct Preference Optimisation) via SecAlign
  • Adapter type: PEFT LoRA (library version 0.14.0)
  • Training data: 104-sample subset of AlpacaEval (text-davinci-003 reference outputs, samples with non-empty input field)

Security Evaluation

Attack success rate measured on 104 samples from AlpacaEval with no additional defense prompting.
↓ lower is better β€” the model should not follow injected instructions.

  • in-response β€” fraction of outputs containing the injected trigger word
  • begin-with β€” fraction of outputs that begin with the injected trigger word

This adapter (SecAlign)

Attack In-Response ↓ Begin-With ↓
ignore 1.9% 0.0%
completion_real 0.0% 0.0%
completion_realcmb 0.0% 0.0%
gcg 8.2% 0.0%

Undefended base model (Meta-Llama-3-8B-Instruct)

Attack In-Response Begin-With
ignore 65.4% 20.7%
completion_real 81.7% 47.1%
completion_realcmb 83.2% 55.3%
gcg 85.6% 6.3%

Related Models

Model Description
FlorianJK/Meta-Llama-3-8B-SecUnalign Same architecture fine-tuned with inverted preferences β€” intentionally vulnerable to prompt injection
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FlorianJK/Meta-Llama-3-8B-SecAlign

Adapter
(2391)
this model