SAVANT Multimodal Evaluation Model (LoRA Adapter)

This repository contains the LoRA adapter for the multimodal anomaly evaluation model (Phase 2) described in the paper Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning.

Project Page: https://TUM-AVS.github.io/SAVANT/

This repository is provided for peer-review purposes only. After the review process, the model will be made publicly available through the authors' main account.

Model Description

LoRA adapter for Qwen/Qwen2.5-VL-7B-Instruct, fine-tuned for anomaly evaluation using both the driving scene image and a structured scene description. This is Phase 2 of the SAVANT two-phase pipeline.

The model receives:

  1. The original front-camera image
  2. A structured scene description (generated by the Phase 1 model)

And outputs a binary anomaly classification with detailed reasoning.

Pipeline Performance

When used as part of the full SAVANT pipeline (Phase 1 + Phase 2), evaluated on a balanced test set of 1,020 driving scene images:

Metric Value
Accuracy 83.7%
Precision 85.1%
Recall 81.8%
F1-Score 83.4%

Training Details

  • Base model: Qwen/Qwen2.5-VL-7B-Instruct
  • Method: LoRA (Low-Rank Adaptation)
  • Dataset: 4,260 samples with image + scene description + anomaly labels
  • Epochs: 3
  • Learning rate: 1e-4 (cosine schedule)
  • Precision: bfloat16 with Flash Attention 2

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, fc1, fc2, qkv, mlp.0, mlp.2

Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "u94fmn391j/SAVANT-multimodal-evaluation-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

Limitations

  • Trained on the CODA dataset; generalization to other driving domains not evaluated
  • Single-frame analysis only (no temporal context)
  • Pipeline performance depends on the quality of the Phase 1 scene description
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Brusnicki/SAVANT-multimodal-evaluation-lora

Adapter
(282)
this model

Paper for Brusnicki/SAVANT-multimodal-evaluation-lora