Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Paper • 2502.00653 • Published
How to use ericyinyzy/SafeMLLM-LLaVA-13B with PEFT:
from peft import PeftModel
from transformers import AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained("/data/ziyi/llava15-llama-2-13b-chat-lightning-preview")
model = PeftModel.from_pretrained(base_model, "ericyinyzy/SafeMLLM-LLaVA-13B")LoRA adapter that turns liuhaotian/llava-v1.5-13b into a jailbreak-robust multimodal model, trained with the SafeMLLM framework described in:
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks Ziyi Yin, Yuanpu Cao, Han Liu, Ting Wang, Jinghui Chen, Fenglong Ma — arXiv:2502.00653 (2025).
| File | What it is |
|---|---|
adapter_config.json |
PEFT LoRA config |
adapter_model.bin |
LoRA weights (rank-r updates on attention/MLP layers) |
non_lora_trainables.bin |
Vision-language projector weights |
config.json |
LLaVA model config snapshot |
trainer_state.json |
Training-time logs |
To use it you also need the LLaVA-1.5-13B base weights from
liuhaotian/llava-v1.5-13b.
git clone https://github.com/ericyinyzy/SafeMLLM.git
cd SafeMLLM
conda env create -f environment.yml && conda activate safemllm-llava
mkdir -p checkpoints
huggingface-cli download liuhaotian/llava-v1.5-13b --local-dir checkpoints/llava-v1.5-13b
huggingface-cli download ericyinyzy/SafeMLLM-LLaVA-13B --local-dir checkpoints/SafeMLLM-LLaVA-13B
export LLAVA13B_BASE=$PWD/checkpoints/llava-v1.5-13b
export SAFEMLLM_L13B=$PWD/checkpoints/SafeMLLM-LLaVA-13B
bash scripts/run_L13B.sh 0 # GPU id
Programmatic loading:
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
tokenizer, model, image_processor, _ = load_pretrained_model(
model_path="ericyinyzy/SafeMLLM-LLaVA-13B",
model_base="liuhaotian/llava-v1.5-13b",
model_name=get_model_name_from_path("ericyinyzy/SafeMLLM-LLaVA-13B"),
)
| Use case | VRAM |
|---|---|
| Inference (fp16) | ~32 GB |
| ImgJP attack (PGD) | ~46 GB |
For a single 24 GB GPU, pass load_in_8bit=True to load_pretrained_model and
reduce ImgJP --iters 40.
Apache-2.0 for the adapter weights. The underlying LLaVA-1.5 base model retains
its own license; see
liuhaotian/llava-v1.5-13b.
@article{yin2025safemllm,
title = {Towards Robust Multimodal Large Language Models Against Jailbreak Attacks},
author = {Yin, Ziyi and Cao, Yuanpu and Liu, Han and Wang, Ting and Chen, Jinghui and Ma, Fenglong},
journal = {arXiv preprint arXiv:2502.00653},
year = {2025}
}
Base model
liuhaotian/llava-v1.5-13b