Image-Text-to-Text
PEFT
Safetensors
English
lora
vise
self-evolving
multimodal
vision-language
lmm
visual-grounding
image-captioning
qwen3-vl
unsupervised
conversational
Instructions to use shravvvv/VISE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use shravvvv/VISE with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-VL-2B-Instruct") model = PeftModel.from_pretrained(base_model, "shravvvv/VISE") - Notebooks
- Google Colab
- Kaggle
| base_model: Qwen/Qwen3-VL-2B-Instruct | |
| base_model_relation: adapter | |
| library_name: peft | |
| pipeline_tag: image-text-to-text | |
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - lora | |
| - peft | |
| - vise | |
| - self-evolving | |
| - multimodal | |
| - vision-language | |
| - lmm | |
| - visual-grounding | |
| - image-captioning | |
| - qwen3-vl | |
| - unsupervised | |
| # VISE: Visual Invariance Self-Evolution | |
| This is the VISE LoRA adapter for `Qwen/Qwen3-VL-2B-Instruct`, from our paper | |
| [Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models](https://arxiv.org/abs/2606.27373). | |
| VISE is a purely unsupervised, single-model self-evolving framework. Instead of | |
| optimizing answer agreement like prior self-evolving LMMs, it strengthens the | |
| model's visual conditioning, which is how much the decoder actually attends to the | |
| image while it generates. We train on raw, unlabeled images with no captions, | |
| bounding boxes, labels, external reward models, or specialist roles, using two | |
| invariance rewards computed from the model's own predictions: | |
| - **Geometric invariance:** rewards consistent localization of the same object | |
| under a known spatial transform (affine, crop, or flip). | |
| - **Semantic invariance:** blurs ("ghosts") the predicted region and rewards the | |
| model only if it judges the object visible before ghosting and not visible after. | |
| We combine them as `R = 0.5*R_geo + 0.5*R_sem` and optimize with KL-regularized | |
| REINFORCE against a frozen reference policy. | |
| ## Usage | |
| This is a LoRA adapter, so load the base model first and attach the adapter: | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from transformers import AutoModelForVision2Seq, AutoProcessor | |
| from peft import PeftModel | |
| BASE = "Qwen/Qwen3-VL-2B-Instruct" | |
| ADAPTER = "shravvvv/VISE" | |
| model = AutoModelForVision2Seq.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto") | |
| model = PeftModel.from_pretrained(model, ADAPTER) | |
| processor = AutoProcessor.from_pretrained(ADAPTER) | |
| model.eval() | |
| image = Image.open("example.jpg").convert("RGB") | |
| messages = [{"role": "user", "content": [ | |
| {"type": "image", "image": image}, | |
| {"type": "text", "text": "Describe this image in detail."}, | |
| ]}] | |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) | |
| with torch.inference_mode(): | |
| out = model.generate(**inputs, max_new_tokens=128) | |
| print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]) | |
| ``` | |
| ## Results (Qwen3-VL-2B) | |
| | Metric | Base | VISE | | |
| | --- | --- | --- | | |
| | COCO (CIDEr) | 21.54 | 38.39 | | |
| | NoCaps (CIDEr) | 19.52 | 34.25 | | |
| | Flickr30k (CIDEr) | 26.09 | 42.64 | | |
| | TextCaps (CIDEr) | 22.20 | 41.86 | | |
| | CHAIR-I (lower is better) | 13.21 | 8.21 | | |
| | CHAIR-S (lower is better) | 45.96 | 40.51 | | |
| | POPE Accuracy | 89.01 | 90.03 | | |
| | ScienceQA | 79.42 | 83.61 | | |
| VISE improves captioning, VQA, reasoning, and hallucination together with no task | |
| tradeoffs, and the same recipe generalizes across larger scales and other backbone | |
| families. Full results are in our paper. | |
| ## Training | |
| - Base: `Qwen/Qwen3-VL-2B-Instruct`, vision encoder frozen. | |
| - LoRA: `r=16`, `alpha=32`, `dropout=0.05` on the attention, MLP, and projector layers. | |
| - Optimizer: AdamW, `lr=1e-6`, weight decay `0.01`, gradient clipping `1.0`, bfloat16. | |
| - RL: KL-regularized REINFORCE, adaptive KL (target `0.020`), reward weights `0.5 / 0.5`. | |
| - Data: 4,000 raw, unlabeled COCO images. No question/answer pairs or annotations. | |
| ## License | |
| Apache 2.0. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{venkatraman2026vise, | |
| title = {Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models}, | |
| author = {Venkatraman, Shravan and Thawkar, Ritesh and Thawakar, Omkar and | |
| Anwer, Rao Muhammad and Cholakkal, Hisham and Khan, Salman and Khan, Fahad Shahbaz}, | |
| booktitle = {European Conference on Computer Vision (ECCV)}, | |
| year = {2026} | |
| } | |
| ``` | |