LLaVA-1.5-SDPO (Symmetric Polarity-Inverted DPO)

This model card describes the visual alignment model LLaVA-1.5-SDPO, a vision-language model (VLM) fine-tuned using a 4-term Symmetric Polarity-Inverted Preference Loss (SymDPO/SDPO) to enhance visual intelligence, specifically in comprehending and logically reasoning about optical and visual illusions.

Model Details

Model Description

Standard vision-language models frequently fail basic visual intelligence and spatial consistency tests. For instance, when presented with a visual illusion, their responses often change inconsistently based on how the question is framed.

This model is fine-tuned from LLaVA-1.5-7B on a custom polarity-inverted preference dataset. By applying Symmetric Polarity Direct Preference Optimization (SDPO), the model is trained to remain logically and visually consistent when prompt polarity is inverted (e.g. asking which element appears "longer" vs. "shorter") on the exact same static illusion image.

  • Developed by: Matisse van Schalkwijk
  • Model type: Vision-Language Model (LoRA Adapter on llava-hf/llava-1.5-7b-hf language model + fine-tuned multi-modal projector)
  • Language(s): English
  • License: Apache 2.0 / LLaVA Research License
  • Finetuned from model: llava-hf/llava-1.5-7b-hf

Model Sources

Uses

Direct Use

  • Visual Intelligence Research: Psychometric evaluation and probing of VLMs on geometric (Müller-Lyer, Ponzo, Ebbinghaus), color/contrast (Simultaneous Contrast, White's Illusion), angle (Zöllner, Poggendorff), and motion (Scintillating Grid) visual illusions.
  • Consistency Analysis: Studying spatial and semantic consistency under opposite prompt framings.

Out-of-Scope Use

  • Critical decision-making applications (e.g., medical imaging analysis, autonomous driving visual perception, high-stakes safety tasks) where guaranteed visual accuracy is required without human-in-the-loop oversight.

Bias, Risks, and Limitations

Like all large vision-language models, LLaVA-1.5-SDPO is subject to hallucination, social biases inherent in its pretraining data, and varying accuracy across complex scenes. It is primarily intended as a research release for evaluating VLM consistency and visual intelligence.

How to Get Started with the Model

Because this fine-tuning run updates both the language backbone (via LoRA adapters) and the multimodal projector weights, you should load both components. Use the snippet below to download and initialize the model:

import torch
from transformers import pipeline, AutoProcessor
from peft import PeftModel
from huggingface_hub import hf_hub_download

# 1. Initialize base LLaVA-1.5 model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
adapter_id = "Matisse6410/LlaVa-1.5-SDPO"

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline("image-text-to-text", model=model_id, torch_dtype=torch.bfloat16, device_map="auto")

# 2. Load the trained LoRA adapter weights
pipe.model = PeftModel.from_pretrained(pipe.model, adapter_id)

# 3. Download and load the custom fine-tuned multi-modal projector weights
projector_file = hf_hub_download(repo_id=adapter_id, filename="multi_modal_projector.pt")
projector_state = torch.load(projector_file, map_location=pipe.model.device)
pipe.model.base_model.model.model.multi_modal_projector.load_state_dict(projector_state)

# Now ready for inference!

Training Details

Training Data

The model was trained on the Symmetric Polarity-Inverted Preference Dataset, consisting of:

  • Polarity Pairs: Textual prompts and corresponding chosen/rejected response pairs representing visual illusions across categories: Geometric, Color, Angle, and Motion.
  • Control VQA Safeguard: Approximately 20% of the training data consists of non-illusion factual control visual question-answering entries (e.g., "How many lines are in this image?", "What colour is the background?") to mitigate catastrophic forgetting of general visual capabilities during the preference alignment process.

Training Procedure

Fine-tuning is performed using the Symmetric Polarity Preference Loss formulation:

L(θ)=LDPO,m(θ)+γLSymmetric(θ)+λLMargin(θ)+ηLAncPO(θ)\mathcal{L}(\theta) = \mathcal{L}_{\text{DPO}, m}(\theta) + \gamma \mathcal{L}_{\text{Symmetric}}(\theta) + \lambda \mathcal{L}_{\text{Margin}}(\theta) + \eta \mathcal{L}_{\text{AncPO}}(\theta)

This multi-term loss function optimizes:

  1. Standard DPO Loss on the original prompt polarity.
  2. Symmetric DPO Loss on the inverted prompt polarity.
  3. Preference-Margin Consistency Loss to minimize variance between the original and inverted preference gaps.
  4. Anchored Preference Loss (AncPO) to stabilize the absolute log-likelihoods of chosen responses.

During training, the CLIP vision encoder remains frozen, the multi-modal projector is fully unfrozen and updated, and the language model is adapted using LoRA on its projection layers.

Training Hyperparameters

  • DPO Temperature ($\beta$): 0.1
  • Symmetric Loss Weight ($\gamma$): 1.0
  • Preference Margin Weight ($\lambda$): 0.5
  • Anchored Preference Weight ($\eta$): 0.1
  • LoRA Rank ($r$): 64
  • LoRA Alpha ($\alpha$): 16
  • LoRA Dropout: 0.05
  • Learning Rate: $1.0 \times 10^{-5}$
  • Learning Rate Schedule: Linear warmup (first 5% of steps) followed by Cosine learning rate decay.
  • Optimizer: AdamW

Environmental Impact

  • Hardware Type: NVIDIA GPUs (A100 / H100 cluster)
  • Precision: BF16 Mixed Precision

Model Card Authors

  • Matisse van Schalkwijk

Framework Versions

  • PEFT 0.19.1
  • PyTorch 2.4+
  • Transformers 4.45+
Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Matisse6410/LlaVa-1.5-SDPO

Adapter
(150)
this model