SAVANT Multimodal Evaluation Model (LoRA Adapter)

This repository contains the LoRA adapter for the multimodal anomaly evaluation model (Phase 2) described in the paper Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning.

Project Page: https://TUM-AVS.github.io/SAVANT/

This repository is provided for peer-review purposes only. After the review process, the model will be made publicly available through the authors' main account.

Model Description

LoRA adapter for Qwen/Qwen2.5-VL-7B-Instruct, fine-tuned for anomaly evaluation using both the driving scene image and a structured scene description. This is Phase 2 of the SAVANT two-phase pipeline.

The model receives:

The original front-camera image
A structured scene description (generated by the Phase 1 model)

And outputs a binary anomaly classification with detailed reasoning.

Pipeline Performance

When used as part of the full SAVANT pipeline (Phase 1 + Phase 2), evaluated on a balanced test set of 1,020 driving scene images:

Metric	Value
Accuracy	83.7%
Precision	85.1%
Recall	81.8%
F1-Score	83.4%

Training Details

Base model: Qwen/Qwen2.5-VL-7B-Instruct
Method: LoRA (Low-Rank Adaptation)
Dataset: 4,260 samples with image + scene description + anomaly labels
Epochs: 3
Learning rate: 1e-4 (cosine schedule)
Precision: bfloat16 with Flash Attention 2

LoRA Configuration

Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, qkv, mlp.0, mlp.2

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "u94fmn391j/SAVANT-multimodal-evaluation-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Limitations

Trained on the CODA dataset; generalization to other driving domains not evaluated
Single-frame analysis only (no temporal context)
Pipeline performance depends on the quality of the Phase 1 scene description

Downloads last month: 41

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brusnicki/SAVANT-multimodal-evaluation-lora

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(282)

this model

Paper for Brusnicki/SAVANT-multimodal-evaluation-lora

SAVANT: Semantic Analysis with Vision-Augmented Anomaly deTection

Paper • 2510.18034 • Published Oct 20, 2025 • 4