VISE / README.md

Add arXiv paper link

2fc58ec verified 9 days ago

3.95 kB

	---
	base_model: Qwen/Qwen3-VL-2B-Instruct
	base_model_relation: adapter
	library_name: peft
	pipeline_tag: image-text-to-text
	license: apache-2.0
	language:
	- en
	tags:
	- lora
	- peft
	- vise
	- self-evolving
	- multimodal
	- vision-language
	- lmm
	- visual-grounding
	- image-captioning
	- qwen3-vl
	- unsupervised
	---

	# VISE: Visual Invariance Self-Evolution

	This is the VISE LoRA adapter for `Qwen/Qwen3-VL-2B-Instruct`, from our paper
	[Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models](https://arxiv.org/abs/2606.27373).

	VISE is a purely unsupervised, single-model self-evolving framework. Instead of
	optimizing answer agreement like prior self-evolving LMMs, it strengthens the
	model's visual conditioning, which is how much the decoder actually attends to the
	image while it generates. We train on raw, unlabeled images with no captions,
	bounding boxes, labels, external reward models, or specialist roles, using two
	invariance rewards computed from the model's own predictions:

	- Geometric invariance: rewards consistent localization of the same object
	under a known spatial transform (affine, crop, or flip).
	- Semantic invariance: blurs ("ghosts") the predicted region and rewards the
	model only if it judges the object visible before ghosting and not visible after.

	We combine them as `R = 0.5R_geo + 0.5R_sem` and optimize with KL-regularized
	REINFORCE against a frozen reference policy.

	## Usage

	This is a LoRA adapter, so load the base model first and attach the adapter:

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForVision2Seq, AutoProcessor
	from peft import PeftModel

	BASE = "Qwen/Qwen3-VL-2B-Instruct"
	ADAPTER = "shravvvv/VISE"

	model = AutoModelForVision2Seq.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
	model = PeftModel.from_pretrained(model, ADAPTER)
	processor = AutoProcessor.from_pretrained(ADAPTER)
	model.eval()

	image = Image.open("example.jpg").convert("RGB")
	messages = [{"role": "user", "content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "Describe this image in detail."},
	]}]
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

	with torch.inference_mode():
	out = model.generate(**inputs, max_new_tokens=128)
	print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
	```

	## Results (Qwen3-VL-2B)

	\| Metric \| Base \| VISE \|
	\| --- \| --- \| --- \|
	\| COCO (CIDEr) \| 21.54 \| 38.39 \|
	\| NoCaps (CIDEr) \| 19.52 \| 34.25 \|
	\| Flickr30k (CIDEr) \| 26.09 \| 42.64 \|
	\| TextCaps (CIDEr) \| 22.20 \| 41.86 \|
	\| CHAIR-I (lower is better) \| 13.21 \| 8.21 \|
	\| CHAIR-S (lower is better) \| 45.96 \| 40.51 \|
	\| POPE Accuracy \| 89.01 \| 90.03 \|
	\| ScienceQA \| 79.42 \| 83.61 \|

	VISE improves captioning, VQA, reasoning, and hallucination together with no task
	tradeoffs, and the same recipe generalizes across larger scales and other backbone
	families. Full results are in our paper.

	## Training

	- Base: `Qwen/Qwen3-VL-2B-Instruct`, vision encoder frozen.
	- LoRA: `r=16`, `alpha=32`, `dropout=0.05` on the attention, MLP, and projector layers.
	- Optimizer: AdamW, `lr=1e-6`, weight decay `0.01`, gradient clipping `1.0`, bfloat16.
	- RL: KL-regularized REINFORCE, adaptive KL (target `0.020`), reward weights `0.5 / 0.5`.
	- Data: 4,000 raw, unlabeled COCO images. No question/answer pairs or annotations.

	## License

	Apache 2.0.

	## Citation

	```bibtex
	@inproceedings{venkatraman2026vise,
	title = {Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models},
	author = {Venkatraman, Shravan and Thawkar, Ritesh and Thawakar, Omkar and
	Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Salman and Khan, Fahad Shahbaz},
	booktitle = {European Conference on Computer Vision (ECCV)},
	year = {2026}
	}
	```