Instructions to use Matisse6410/LlaVa-1.5-SDPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Matisse6410/LlaVa-1.5-SDPO with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = PeftModel.from_pretrained(base_model, "Matisse6410/LlaVa-1.5-SDPO")

Transformers

How to use Matisse6410/LlaVa-1.5-SDPO with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Matisse6410/LlaVa-1.5-SDPO")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Matisse6410/LlaVa-1.5-SDPO", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Matisse6410/LlaVa-1.5-SDPO with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Matisse6410/LlaVa-1.5-SDPO"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Matisse6410/LlaVa-1.5-SDPO",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Matisse6410/LlaVa-1.5-SDPO

SGLang

How to use Matisse6410/LlaVa-1.5-SDPO with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Matisse6410/LlaVa-1.5-SDPO" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Matisse6410/LlaVa-1.5-SDPO",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Matisse6410/LlaVa-1.5-SDPO" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Matisse6410/LlaVa-1.5-SDPO",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Matisse6410/LlaVa-1.5-SDPO with Docker Model Runner:
```
docker model run hf.co/Matisse6410/LlaVa-1.5-SDPO
```

LLaVA-1.5-SDPO (Symmetric Polarity-Inverted DPO)

This model card describes the visual alignment model LLaVA-1.5-SDPO, a vision-language model (VLM) fine-tuned using a 4-term Symmetric Polarity-Inverted Preference Loss (SymDPO/SDPO) to enhance visual intelligence, specifically in comprehending and logically reasoning about optical and visual illusions.

Model Details

Model Description

Standard vision-language models frequently fail basic visual intelligence and spatial consistency tests. For instance, when presented with a visual illusion, their responses often change inconsistently based on how the question is framed.

This model is fine-tuned from LLaVA-1.5-7B on a custom polarity-inverted preference dataset. By applying Symmetric Polarity Direct Preference Optimization (SDPO), the model is trained to remain logically and visually consistent when prompt polarity is inverted (e.g. asking which element appears "longer" vs. "shorter") on the exact same static illusion image.

Developed by: Matisse van Schalkwijk
Model type: Vision-Language Model (LoRA Adapter on llava-hf/llava-1.5-7b-hf language model + fine-tuned multi-modal projector)
Language(s): English
License: Apache 2.0 / LLaVA Research License
Finetuned from model: llava-hf/llava-1.5-7b-hf

Model Sources

Repository: Matisse6410/LlaVa-1.5-SDPO

Uses

Direct Use

Visual Intelligence Research: Psychometric evaluation and probing of VLMs on geometric (Müller-Lyer, Ponzo, Ebbinghaus), color/contrast (Simultaneous Contrast, White's Illusion), angle (Zöllner, Poggendorff), and motion (Scintillating Grid) visual illusions.
Consistency Analysis: Studying spatial and semantic consistency under opposite prompt framings.

Out-of-Scope Use

Critical decision-making applications (e.g., medical imaging analysis, autonomous driving visual perception, high-stakes safety tasks) where guaranteed visual accuracy is required without human-in-the-loop oversight.

Bias, Risks, and Limitations

Like all large vision-language models, LLaVA-1.5-SDPO is subject to hallucination, social biases inherent in its pretraining data, and varying accuracy across complex scenes. It is primarily intended as a research release for evaluating VLM consistency and visual intelligence.

How to Get Started with the Model

Because this fine-tuning run updates both the language backbone (via LoRA adapters) and the multimodal projector weights, you should load both components. Use the snippet below to download and initialize the model:

import torch
from transformers import pipeline, AutoProcessor
from peft import PeftModel
from huggingface_hub import hf_hub_download

# 1. Initialize base LLaVA-1.5 model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
adapter_id = "Matisse6410/LlaVa-1.5-SDPO"

processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline("image-text-to-text", model=model_id, torch_dtype=torch.bfloat16, device_map="auto")

# 2. Load the trained LoRA adapter weights
pipe.model = PeftModel.from_pretrained(pipe.model, adapter_id)

# 3. Download and load the custom fine-tuned multi-modal projector weights
projector_file = hf_hub_download(repo_id=adapter_id, filename="multi_modal_projector.pt")
projector_state = torch.load(projector_file, map_location=pipe.model.device)
pipe.model.base_model.model.model.multi_modal_projector.load_state_dict(projector_state)

# Now ready for inference!

Training Details

Training Data

The model was trained on the Symmetric Polarity-Inverted Preference Dataset, consisting of:

Polarity Pairs: Textual prompts and corresponding chosen/rejected response pairs representing visual illusions across categories: Geometric, Color, Angle, and Motion.
Control VQA Safeguard: Approximately 20% of the training data consists of non-illusion factual control visual question-answering entries (e.g., "How many lines are in this image?", "What colour is the background?") to mitigate catastrophic forgetting of general visual capabilities during the preference alignment process.

Training Procedure

Fine-tuning is performed using the Symmetric Polarity Preference Loss formulation:

$\mathcal{L}(\theta) = \mathcal{L}_{\text{DPO}, m}(\theta) + \gamma \mathcal{L}_{\text{Symmetric}}(\theta) + \lambda \mathcal{L}_{\text{Margin}}(\theta) + \eta \mathcal{L}_{\text{AncPO}}(\theta)$

This multi-term loss function optimizes:

Standard DPO Loss on the original prompt polarity.
Symmetric DPO Loss on the inverted prompt polarity.
Preference-Margin Consistency Loss to minimize variance between the original and inverted preference gaps.
Anchored Preference Loss (AncPO) to stabilize the absolute log-likelihoods of chosen responses.

During training, the CLIP vision encoder remains frozen, the multi-modal projector is fully unfrozen and updated, and the language model is adapted using LoRA on its projection layers.

Training Hyperparameters

DPO Temperature ($\beta$): 0.1
Symmetric Loss Weight ($\gamma$): 1.0
Preference Margin Weight ($\lambda$): 0.5
Anchored Preference Weight ($\eta$): 0.1
LoRA Rank ($r$): 64
LoRA Alpha ($\alpha$): 16
LoRA Dropout: 0.05
Learning Rate: $1.0 \times 10^{-5}$
Learning Rate Schedule: Linear warmup (first 5% of steps) followed by Cosine learning rate decay.
Optimizer: AdamW

Environmental Impact

Hardware Type: NVIDIA GPUs (A100 / H100 cluster)
Precision: BF16 Mixed Precision

Model Card Authors

Matisse van Schalkwijk

Framework Versions

PEFT 0.19.1
PyTorch 2.4+
Transformers 4.45+

Downloads last month: 28

Model tree for Matisse6410/LlaVa-1.5-SDPO

Base model

llava-hf/llava-1.5-7b-hf

Adapter

(150)

this model