Instructions to use Matisse6410/LlaVa-1.5-SDPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Matisse6410/LlaVa-1.5-SDPO with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("llava-hf/llava-1.5-7b-hf") model = PeftModel.from_pretrained(base_model, "Matisse6410/LlaVa-1.5-SDPO") - Transformers
How to use Matisse6410/LlaVa-1.5-SDPO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Matisse6410/LlaVa-1.5-SDPO")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Matisse6410/LlaVa-1.5-SDPO", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Matisse6410/LlaVa-1.5-SDPO with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Matisse6410/LlaVa-1.5-SDPO" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Matisse6410/LlaVa-1.5-SDPO", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Matisse6410/LlaVa-1.5-SDPO
- SGLang
How to use Matisse6410/LlaVa-1.5-SDPO with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Matisse6410/LlaVa-1.5-SDPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Matisse6410/LlaVa-1.5-SDPO", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Matisse6410/LlaVa-1.5-SDPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Matisse6410/LlaVa-1.5-SDPO", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Matisse6410/LlaVa-1.5-SDPO with Docker Model Runner:
docker model run hf.co/Matisse6410/LlaVa-1.5-SDPO
LLaVA-1.5-SDPO (Symmetric Polarity-Inverted DPO)
This model card describes the visual alignment model LLaVA-1.5-SDPO, a vision-language model (VLM) fine-tuned using a 4-term Symmetric Polarity-Inverted Preference Loss (SymDPO/SDPO) to enhance visual intelligence, specifically in comprehending and logically reasoning about optical and visual illusions.
Model Details
Model Description
Standard vision-language models frequently fail basic visual intelligence and spatial consistency tests. For instance, when presented with a visual illusion, their responses often change inconsistently based on how the question is framed.
This model is fine-tuned from LLaVA-1.5-7B on a custom polarity-inverted preference dataset. By applying Symmetric Polarity Direct Preference Optimization (SDPO), the model is trained to remain logically and visually consistent when prompt polarity is inverted (e.g. asking which element appears "longer" vs. "shorter") on the exact same static illusion image.
- Developed by: Matisse van Schalkwijk
- Model type: Vision-Language Model (LoRA Adapter on
llava-hf/llava-1.5-7b-hflanguage model + fine-tuned multi-modal projector) - Language(s): English
- License: Apache 2.0 / LLaVA Research License
- Finetuned from model: llava-hf/llava-1.5-7b-hf
Model Sources
- Repository: Matisse6410/LlaVa-1.5-SDPO
Uses
Direct Use
- Visual Intelligence Research: Psychometric evaluation and probing of VLMs on geometric (Müller-Lyer, Ponzo, Ebbinghaus), color/contrast (Simultaneous Contrast, White's Illusion), angle (Zöllner, Poggendorff), and motion (Scintillating Grid) visual illusions.
- Consistency Analysis: Studying spatial and semantic consistency under opposite prompt framings.
Out-of-Scope Use
- Critical decision-making applications (e.g., medical imaging analysis, autonomous driving visual perception, high-stakes safety tasks) where guaranteed visual accuracy is required without human-in-the-loop oversight.
Bias, Risks, and Limitations
Like all large vision-language models, LLaVA-1.5-SDPO is subject to hallucination, social biases inherent in its pretraining data, and varying accuracy across complex scenes. It is primarily intended as a research release for evaluating VLM consistency and visual intelligence.
How to Get Started with the Model
Because this fine-tuning run updates both the language backbone (via LoRA adapters) and the multimodal projector weights, you should load both components. Use the snippet below to download and initialize the model:
import torch
from transformers import pipeline, AutoProcessor
from peft import PeftModel
from huggingface_hub import hf_hub_download
# 1. Initialize base LLaVA-1.5 model and processor
model_id = "llava-hf/llava-1.5-7b-hf"
adapter_id = "Matisse6410/LlaVa-1.5-SDPO"
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline("image-text-to-text", model=model_id, torch_dtype=torch.bfloat16, device_map="auto")
# 2. Load the trained LoRA adapter weights
pipe.model = PeftModel.from_pretrained(pipe.model, adapter_id)
# 3. Download and load the custom fine-tuned multi-modal projector weights
projector_file = hf_hub_download(repo_id=adapter_id, filename="multi_modal_projector.pt")
projector_state = torch.load(projector_file, map_location=pipe.model.device)
pipe.model.base_model.model.model.multi_modal_projector.load_state_dict(projector_state)
# Now ready for inference!
Training Details
Training Data
The model was trained on the Symmetric Polarity-Inverted Preference Dataset, consisting of:
- Polarity Pairs: Textual prompts and corresponding chosen/rejected response pairs representing visual illusions across categories: Geometric, Color, Angle, and Motion.
- Control VQA Safeguard: Approximately 20% of the training data consists of non-illusion factual control visual question-answering entries (e.g., "How many lines are in this image?", "What colour is the background?") to mitigate catastrophic forgetting of general visual capabilities during the preference alignment process.
Training Procedure
Fine-tuning is performed using the Symmetric Polarity Preference Loss formulation:
This multi-term loss function optimizes:
- Standard DPO Loss on the original prompt polarity.
- Symmetric DPO Loss on the inverted prompt polarity.
- Preference-Margin Consistency Loss to minimize variance between the original and inverted preference gaps.
- Anchored Preference Loss (AncPO) to stabilize the absolute log-likelihoods of chosen responses.
During training, the CLIP vision encoder remains frozen, the multi-modal projector is fully unfrozen and updated, and the language model is adapted using LoRA on its projection layers.
Training Hyperparameters
- DPO Temperature ($\beta$): 0.1
- Symmetric Loss Weight ($\gamma$): 1.0
- Preference Margin Weight ($\lambda$): 0.5
- Anchored Preference Weight ($\eta$): 0.1
- LoRA Rank ($r$): 64
- LoRA Alpha ($\alpha$): 16
- LoRA Dropout: 0.05
- Learning Rate: $1.0 \times 10^{-5}$
- Learning Rate Schedule: Linear warmup (first 5% of steps) followed by Cosine learning rate decay.
- Optimizer: AdamW
Environmental Impact
- Hardware Type: NVIDIA GPUs (A100 / H100 cluster)
- Precision: BF16 Mixed Precision
Model Card Authors
- Matisse van Schalkwijk
Framework Versions
- PEFT 0.19.1
- PyTorch 2.4+
- Transformers 4.45+
- Downloads last month
- 28
Model tree for Matisse6410/LlaVa-1.5-SDPO
Base model
llava-hf/llava-1.5-7b-hf