File size: 5,394 Bytes
f63055b ac3aa33 e2a5945 6544189 e2a5945 5deaff3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: mit
---
# IVT-LR (Qwen2-VL)
## Overview
This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
---
## Usage
This repository provides pretrained Qwen2-VL models for IVT-LR on **M3CoT** and **ScienceQA** datasets.
To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).
---
### Download Models
You can download the models directly from Hugging Face using `huggingface_hub`:
```python
from huggingface_hub import hf_hub_download
# Download Qwen2-VL model trained on M3CoT
qwen_m3cot_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")
# Download Qwen2-VL model trained on ScienceQA
qwen_sqa_path = hf_hub_download("ModalityDance/IVTLR_QWEN_SQA", "model.pth")
```
---
### Quick Start
The following code shows how to load the pretrained IVT-LR model and run inference on a single image-text example. Replace `image` and `text` with your own input.
```python
from transformers import AutoTokenizer, AutoProcessor, Qwen2VLForConditionalGeneration
from qwen_ivtlr import IVTLR
from qwen_vl_utils import process_vision_info
from peft import LoraConfig, get_peft_model
from huggingface_hub import hf_hub_download
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download model
checkpoint_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")
# Load processor and tokenizer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
use_fast=False,
trust_remote_code=True,
padding_side="right"
)
tokenizer.add_special_tokens({
"additional_special_tokens": ["<|start-latent|>", "<|end-latent|>", "<|latent|>"]
})
# Load base model with LoRA
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-7B-Instruct",
device_map="cuda",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="eager"
)
base_model.resize_token_embeddings(len(tokenizer))
processor.tokenizer = tokenizer
lora_config = LoraConfig(
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
r=64, lora_alpha=16, lora_dropout=0.05, bias="none", inference_mode=False
)
base_model = get_peft_model(base_model, lora_config)
# Create IVTLR model
latent_id = tokenizer.convert_tokens_to_ids("<|latent|>")
start_id = tokenizer.convert_tokens_to_ids("<|start-latent|>")
end_id = tokenizer.convert_tokens_to_ids("<|end-latent|>")
image_token_id = tokenizer.convert_tokens_to_ids(processor.image_token)
visual_start_id = tokenizer.convert_tokens_to_ids("<|vision_start|>")
visual_end_id = tokenizer.convert_tokens_to_ids("<|vision_end|>")
model = IVTLR(
base_model,
latent_token_id=latent_id,
start_latent_id=start_id,
end_latent_id=end_id,
eos_token_id=tokenizer.eos_token_id,
image_token_id=image_token_id,
visual_start_id=visual_start_id,
visual_end_id=visual_end_id
)
# Load checkpoint
state_dict = torch.load(checkpoint_path, map_location="cpu")
if any(k.startswith("module.") for k in state_dict.keys()):
state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()
# ============ Inference ============
# Replace with your own image and text
image = "your_image.jpg" # PIL Image or path to image
text = "Your question here"
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image, "resized_height": 280, "resized_width": 280},
{"type": "text", "text": text}
]
}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = prompt + "<|latent|>" * 3 # Add latent tokens
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[prompt],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"],
image_grid_thw=inputs["image_grid_thw"],
max_new_tokens=512
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```
---
## Citation
If you use **IVT-LR** in your research or applications, please consider citing:
```bibtex
@article{chen2025reasoning,
title={Reasoning in the dark: Interleaved vision-text reasoning in latent space},
author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang},
journal={arXiv preprint arXiv:2510.12603},
year={2025}
}
```
|