VISE / README.md
shravvvv's picture
Add arXiv paper link
2fc58ec verified
|
Raw
History Blame Contribute Delete
3.95 kB
---
base_model: Qwen/Qwen3-VL-2B-Instruct
base_model_relation: adapter
library_name: peft
pipeline_tag: image-text-to-text
license: apache-2.0
language:
- en
tags:
- lora
- peft
- vise
- self-evolving
- multimodal
- vision-language
- lmm
- visual-grounding
- image-captioning
- qwen3-vl
- unsupervised
---
# VISE: Visual Invariance Self-Evolution
This is the VISE LoRA adapter for `Qwen/Qwen3-VL-2B-Instruct`, from our paper
[Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models](https://arxiv.org/abs/2606.27373).
VISE is a purely unsupervised, single-model self-evolving framework. Instead of
optimizing answer agreement like prior self-evolving LMMs, it strengthens the
model's visual conditioning, which is how much the decoder actually attends to the
image while it generates. We train on raw, unlabeled images with no captions,
bounding boxes, labels, external reward models, or specialist roles, using two
invariance rewards computed from the model's own predictions:
- **Geometric invariance:** rewards consistent localization of the same object
under a known spatial transform (affine, crop, or flip).
- **Semantic invariance:** blurs ("ghosts") the predicted region and rewards the
model only if it judges the object visible before ghosting and not visible after.
We combine them as `R = 0.5*R_geo + 0.5*R_sem` and optimize with KL-regularized
REINFORCE against a frozen reference policy.
## Usage
This is a LoRA adapter, so load the base model first and attach the adapter:
```python
import torch
from PIL import Image
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel
BASE = "Qwen/Qwen3-VL-2B-Instruct"
ADAPTER = "shravvvv/VISE"
model = AutoModelForVision2Seq.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
processor = AutoProcessor.from_pretrained(ADAPTER)
model.eval()
image = Image.open("example.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
```
## Results (Qwen3-VL-2B)
| Metric | Base | VISE |
| --- | --- | --- |
| COCO (CIDEr) | 21.54 | 38.39 |
| NoCaps (CIDEr) | 19.52 | 34.25 |
| Flickr30k (CIDEr) | 26.09 | 42.64 |
| TextCaps (CIDEr) | 22.20 | 41.86 |
| CHAIR-I (lower is better) | 13.21 | 8.21 |
| CHAIR-S (lower is better) | 45.96 | 40.51 |
| POPE Accuracy | 89.01 | 90.03 |
| ScienceQA | 79.42 | 83.61 |
VISE improves captioning, VQA, reasoning, and hallucination together with no task
tradeoffs, and the same recipe generalizes across larger scales and other backbone
families. Full results are in our paper.
## Training
- Base: `Qwen/Qwen3-VL-2B-Instruct`, vision encoder frozen.
- LoRA: `r=16`, `alpha=32`, `dropout=0.05` on the attention, MLP, and projector layers.
- Optimizer: AdamW, `lr=1e-6`, weight decay `0.01`, gradient clipping `1.0`, bfloat16.
- RL: KL-regularized REINFORCE, adaptive KL (target `0.020`), reward weights `0.5 / 0.5`.
- Data: 4,000 raw, unlabeled COCO images. No question/answer pairs or annotations.
## License
Apache 2.0.
## Citation
```bibtex
@inproceedings{venkatraman2026vise,
title = {Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models},
author = {Venkatraman, Shravan and Thawkar, Ritesh and Thawakar, Omkar and
Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Salman and Khan, Fahad Shahbaz},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
```