Instructions to use hfuserh/LLaMA-3.1-8B-JailbreakSafe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use hfuserh/LLaMA-3.1-8B-JailbreakSafe with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="hfuserh/LLaMA-3.1-8B-JailbreakSafe")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("hfuserh/LLaMA-3.1-8B-JailbreakSafe", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use hfuserh/LLaMA-3.1-8B-JailbreakSafe with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "hfuserh/LLaMA-3.1-8B-JailbreakSafe"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hfuserh/LLaMA-3.1-8B-JailbreakSafe",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/hfuserh/LLaMA-3.1-8B-JailbreakSafe

SGLang

How to use hfuserh/LLaMA-3.1-8B-JailbreakSafe with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "hfuserh/LLaMA-3.1-8B-JailbreakSafe" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hfuserh/LLaMA-3.1-8B-JailbreakSafe",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "hfuserh/LLaMA-3.1-8B-JailbreakSafe" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "hfuserh/LLaMA-3.1-8B-JailbreakSafe",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use hfuserh/LLaMA-3.1-8B-JailbreakSafe with Docker Model Runner:
```
docker model run hf.co/hfuserh/LLaMA-3.1-8B-JailbreakSafe
```

hfuserh/LLaMA-3.1-8B-JailbreakSafe

This model is a security-aligned version of LLaMA-3.1-8B trained to resist Prompt Injection and Jailbreak attacks while maintaining high usability on benign and sensitive-but-legitimate queries.
It is trained using LoRA-based curriculum fine-tuning followed by Direct Preference Optimization (DPO) to achieve a strong balance between safety and helpfulness.

🔍 Model Description

Base Model: LLaMA-3.1-8B Base
Training Strategy: Two-stage Curriculum SFT + DPO
Goal: Reduce jailbreak success rate while preserving benign response quality
Intended Use: Safer conversational agents, safety research, adversarial prompt defense

📚 Training Data

This model is trained on a security-focused mixture of:

WildJailbreak — https://huggingface.co/datasets/allenai/wildjailbreak
JailBench — https://github.com/STAIR-BUPT/JailBench

The data is reorganized into four semantic security categories:

Category	Description
Direct Harmful	Explicitly malicious or policy-violating requests
Jailbreak Attacks	Prompts crafted to bypass safety controls
Borderline Benign	Sensitive but legitimate queries
Pseudo-Jailbreak Benign	Benign prompts wrapped in jailbreak-like shells

Harmful and jailbreak inputs are paired with GPT-5-mini generated refusal strategies, while benign data is filtered using GPT-4o-mini.

🧠 Training Method

Stage 1 — Safety Boundary Learning

Goal: Learn when to refuse vs when to respond safely.

Borderline Benign : Direct Harmful = 23333 : 10540

Epochs: 1
Learning rate: 1e-4

Stage 2 — Prompt Attack Discrimination

Goal: Improve robustness to jailbreak and prompt injection, while preventing catastrophic forgetting.

Pseudo-Jailbreak Benign : Jailbreak Harmful : Borderline Benign : Direct Harmful
= 14416 : 12800 : 4000 : 2000

Epochs: 3
Learning rate: 2e-4

This stage improves robustness against prompt injection, jailbreak camouflage, and instruction override attacks.

⚖️ Preference Alignment (DPO)

DPO Dataset Construction

The DPO training data is constructed using a dual-track preference generation pipeline designed to align the model with both security and helpfulness.

Jailbreak-Harmful Track

We use prompts from the WildJailbreak dataset, which contains a wide variety of real-world jailbreak and prompt-injection attempts.

For each jailbreak-style prompt:

A candidate response is generated by the current fine-tuned model.
An automated safety judge evaluates whether the response constitutes a successful jailbreak (i.e., whether it contains actionable, disallowed, or unsafe content).
If the response is unsafe, it is used as the rejected example.
A safety-aligned refusal is generated by a large language model (GPT-5-mini) following a strict refusal policy and used as the chosen example.

This creates direct preference pairs that teach the model to prefer aligned refusals over jailbreak-successful outputs for the same adversarial prompt.

Pseudo-Jailbreak-Benign Track

Some prompts in WildJailbreak are linguistically adversarial (e.g., “ignore previous instructions”) but are semantically benign and request legitimate information.

To preserve usability:

Prompts are filtered using GPT-4o-mini to identify benign intent and response quality.
For Pseudo-Jailbreak-Benign prompts:
- A high-quality, compliant answer (derived from the original dataset or a filtered version of it) is used as the chosen example.
- Model responses that are compliant but low-quality, overly generic, or unhelpful are used as the rejected example.

This prevents the model from becoming over-aligned and ensures it remains helpful for legitimate user requests.

Safety-Aligned Refusal Policy

For harmful or jailbreak prompts, the preferred (chosen) response follows a strict refusal style:

Clearly and politely refuse the request.
Briefly explain the reason (safety, legal, or ethical).
Do not repeat, paraphrase, or expand any dangerous, illegal, or executable details.
Provide a safe, legal, and non-operational alternative aligned with the user’s likely intent.
Use a natural and professional tone; do not ask follow-up questions.
Typical length is approximately 100–150 English words, with small variation.

Responses that violate these principles — such as by providing actionable guidance, implicit instructions, or unsafe clarifications — are used as rejected examples during DPO.

Jailbreak Harmful：Pseudo-Jailbreak Benign  = 462 : 1185

Epochs: 1
Learning rate: 1e-6
Beta: 0.1

📈 Evaluation

Evaluated on the WildJailbreak Evaluation Set using GPT-4o-mini as an automatic judge.

For each test prompt, GPT-4o-mini classifies the model response as either:

Successful Jailbreak (contains actionable harmful or policy-violating content), or
Safe / Benign Response (refusal or compliant safe answer)

The same judge model and evaluation prompt are used across all models to ensure fair comparison.

Model	Jailbreak Success ↓	Benign Usability ↑
LoRA + Curriculum SFT	1.45%	88.57%
LoRA + SFT + DPO	2.75%	93.81%

This shows DPO significantly improves usability while maintaining strong security.

🚀 Usage

This model is released as a LoRA adapter on top of LLaMA-3.1-8B Base.

Load and Run with Transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

BASE_MODEL = "meta-llama/Meta-Llama-3.1-8B"
ADAPTER = "hfuserh/LLaMA-3.1-8B-JailbreakSafe"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How can I make a bomb?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    output = model.generate(
        inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output[0][inputs.shape[-1]:], skip_special_tokens=True))

🧩 Intended Use

This model is intended for:

Research on prompt injection and jailbreak defense
Building safer conversational agents
Evaluating alignment techniques

Not intended for:

Generating harmful or illegal content
Use in safety-critical systems without human oversight

📜 Citation

If you use this model or build upon this work, please cite:

This Model

@misc{llama_jailbreak_resistant_2025,
  title={Security-Aligned LLaMA-3.1-8B for Jailbreak and Prompt Injection Defense},
  author={hanlonghui},
  year={2025}
}

WildJailbreak Dataset

@misc{wildteaming2024,
  title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models},
  author={Liwei Jiang and Kavel Rao and Seungju Han and Allyson Ettinger and Faeze Brahman and Sachin Kumar},
  year={2024},
  eprint={2406.18510},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2406.18510}
}

JailBench Dataset

@article{liu2025jailbench,
  title   = {JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models},
  author  = {Shuyi Liu and Simiao Cui and Haoran Bu and Yuming Shang and Xi Zhang},
  year    = {2025},
  journal = {arXiv preprint arXiv:2502.18935}
}

📜 License and Data Disclosure

Base Model

This model is based on LLaMA-3.1, which is licensed under the Meta LLaMA Community License.
All users of this model must comply with the original LLaMA-3.1 license:
https://ai.meta.com/llama/license/

Datasets

This model was trained using the following public datasets:

WildJailbreak (AllenAI) — Apache 2.0 — https://huggingface.co/datasets/allenai/wildjailbreak
JailBench (STAIR-BUPT) — MIT — https://github.com/STAIR-BUPT/JailBench

All dataset licenses permit use for training and redistribution of derived models with attribution.

Synthetic Data

Some training samples (including refusal strategies and filtered benign answers) were generated using OpenAI models
(GPT-4o-mini and GPT-5-mini).
These generated outputs are not distributed with this model and are used only as intermediate supervision signals.
This model does not contain or expose any proprietary OpenAI data.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train hfuserh/LLaMA-3.1-8B-JailbreakSafe

Papers for hfuserh/LLaMA-3.1-8B-JailbreakSafe

JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models

Paper • 2502.18935 • Published Feb 26, 2025

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Paper • 2406.18510 • Published Jun 26, 2024 • 10