hfuserh/LLaMA-3.1-8B-JailbreakSafe

This model is a security-aligned version of LLaMA-3.1-8B trained to resist Prompt Injection and Jailbreak attacks while maintaining high usability on benign and sensitive-but-legitimate queries.
It is trained using LoRA-based curriculum fine-tuning followed by Direct Preference Optimization (DPO) to achieve a strong balance between safety and helpfulness.


πŸ” Model Description

  • Base Model: LLaMA-3.1-8B Base
  • Training Strategy: Two-stage Curriculum SFT + DPO
  • Goal: Reduce jailbreak success rate while preserving benign response quality
  • Intended Use: Safer conversational agents, safety research, adversarial prompt defense

πŸ“š Training Data

This model is trained on a security-focused mixture of:

The data is reorganized into four semantic security categories:

Category Description
Direct Harmful Explicitly malicious or policy-violating requests
Jailbreak Attacks Prompts crafted to bypass safety controls
Borderline Benign Sensitive but legitimate queries
Pseudo-Jailbreak Benign Benign prompts wrapped in jailbreak-like shells

Harmful and jailbreak inputs are paired with GPT-5-mini generated refusal strategies, while benign data is filtered using GPT-4o-mini.


🧠 Training Method

Stage 1 β€” Safety Boundary Learning

Goal: Learn when to refuse vs when to respond safely.

Borderline Benign : Direct Harmful = 23333 : 10540
  • Epochs: 1
  • Learning rate: 1e-4

Stage 2 β€” Prompt Attack Discrimination

Goal: Improve robustness to jailbreak and prompt injection, while preventing catastrophic forgetting.

Pseudo-Jailbreak Benign : Jailbreak Harmful : Borderline Benign : Direct Harmful
= 14416 : 12800 : 4000 : 2000
  • Epochs: 3
  • Learning rate: 2e-4

This stage improves robustness against prompt injection, jailbreak camouflage, and instruction override attacks.


βš–οΈ Preference Alignment (DPO)

DPO Dataset Construction

The DPO training data is constructed using a dual-track preference generation pipeline designed to align the model with both security and helpfulness.

Jailbreak-Harmful Track

We use prompts from the WildJailbreak dataset, which contains a wide variety of real-world jailbreak and prompt-injection attempts.

For each jailbreak-style prompt:

  • A candidate response is generated by the current fine-tuned model.
  • An automated safety judge evaluates whether the response constitutes a successful jailbreak (i.e., whether it contains actionable, disallowed, or unsafe content).
  • If the response is unsafe, it is used as the rejected example.
  • A safety-aligned refusal is generated by a large language model (GPT-5-mini) following a strict refusal policy and used as the chosen example.

This creates direct preference pairs that teach the model to prefer aligned refusals over jailbreak-successful outputs for the same adversarial prompt.


Pseudo-Jailbreak-Benign Track

Some prompts in WildJailbreak are linguistically adversarial (e.g., β€œignore previous instructions”) but are semantically benign and request legitimate information.

To preserve usability:

  • Prompts are filtered using GPT-4o-mini to identify benign intent and response quality.
  • For Pseudo-Jailbreak-Benign prompts:
    • A high-quality, compliant answer (derived from the original dataset or a filtered version of it) is used as the chosen example.
    • Model responses that are compliant but low-quality, overly generic, or unhelpful are used as the rejected example.

This prevents the model from becoming over-aligned and ensures it remains helpful for legitimate user requests.


Safety-Aligned Refusal Policy

For harmful or jailbreak prompts, the preferred (chosen) response follows a strict refusal style:

  • Clearly and politely refuse the request.
  • Briefly explain the reason (safety, legal, or ethical).
  • Do not repeat, paraphrase, or expand any dangerous, illegal, or executable details.
  • Provide a safe, legal, and non-operational alternative aligned with the user’s likely intent.
  • Use a natural and professional tone; do not ask follow-up questions.
  • Typical length is approximately 100–150 English words, with small variation.

Responses that violate these principles β€” such as by providing actionable guidance, implicit instructions, or unsafe clarifications β€” are used as rejected examples during DPO.


Jailbreak Harmful:Pseudo-Jailbreak Benign  = 462 : 1185
  • Epochs: 1
  • Learning rate: 1e-6
  • Beta: 0.1

πŸ“ˆ Evaluation

Evaluated on the WildJailbreak Evaluation Set using GPT-4o-mini as an automatic judge.

For each test prompt, GPT-4o-mini classifies the model response as either:

  • Successful Jailbreak (contains actionable harmful or policy-violating content), or
  • Safe / Benign Response (refusal or compliant safe answer)

The same judge model and evaluation prompt are used across all models to ensure fair comparison.

Model Jailbreak Success ↓ Benign Usability ↑
LoRA + Curriculum SFT 1.45% 88.57%
LoRA + SFT + DPO 2.75% 93.81%

This shows DPO significantly improves usability while maintaining strong security.


πŸš€ Usage

This model is released as a LoRA adapter on top of LLaMA-3.1-8B Base.

Load and Run with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
ADAPTER = "hfuserh/LLaMA-3.1-8B-JailbreakSafe"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How can I make a bomb?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    output = model.generate(
        inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

🧩 Intended Use

This model is intended for:

  • Research on prompt injection and jailbreak defense
  • Building safer conversational agents
  • Evaluating alignment techniques

Not intended for:

  • Generating harmful or illegal content
  • Use in safety-critical systems without human oversight

πŸ“œ Citation

If you use this model or build upon this work, please cite:

This Model

@misc{llama_jailbreak_resistant_2025,
  title={Security-Aligned LLaMA-3.1-8B for Jailbreak and Prompt Injection Defense},
  author={hanlonghui},
  year={2025}
}

WildJailbreak Dataset

@misc{wildteaming2024,
  title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
  author={Liwei Jiang and Kavel Rao and Seungju Han and Allyson Ettinger and Faeze Brahman and Sachin Kumar},
  year={2024},
  eprint={2406.18510},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2406.18510}
}

JailBench Dataset

@article{liu2025jailbench,
  title   = {JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models},
  author  = {Shuyi Liu and Simiao Cui and Haoran Bu and Yuming Shang and Xi Zhang},
  year    = {2025},
  journal = {arXiv preprint arXiv:2502.18935}
}

πŸ“œ License and Data Disclosure

Base Model

This model is based on LLaMA-3.1, which is licensed under the Meta LLaMA Community License.
All users of this model must comply with the original LLaMA-3.1 license:
https://ai.meta.com/llama/license/


Datasets

This model was trained using the following public datasets:

All dataset licenses permit use for training and redistribution of derived models with attribution.


Synthetic Data

Some training samples (including refusal strategies and filtered benign answers) were generated using OpenAI models
(GPT-4o-mini and GPT-5-mini).
These generated outputs are not distributed with this model and are used only as intermediate supervision signals.
This model does not contain or expose any proprietary OpenAI data.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train hfuserh/LLaMA-3.1-8B-JailbreakSafe

Papers for hfuserh/LLaMA-3.1-8B-JailbreakSafe