File size: 5,395 Bytes

---
license: mit
---
# IVT-LR (Qwen2-VL)

## Overview

This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).

Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.

---

## Usage

This repository provides pretrained Qwen2-VL models for IVT-LR on **M3CoT** and **ScienceQA** datasets.

To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).

---

### Download Models

You can download the models directly from Hugging Face using `huggingface_hub`:

```python
from huggingface_hub import hf_hub_download

# Download Qwen2-VL model trained on M3CoT
qwen_m3cot_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")

# Download Qwen2-VL model trained on ScienceQA
qwen_sqa_path = hf_hub_download("ModalityDance/IVTLR_QWEN_SQA", "model.pth")
```

---

### Quick Start

The following code shows how to load the pretrained IVT-LR model and run inference on a single image-text example. Replace `image` and `text` with your own input.


```python
from transformers import AutoTokenizer, AutoProcessor, Qwen2VLForConditionalGeneration
from qwen_ivtlr import IVTLR
from qwen_vl_utils import process_vision_info
from peft import LoraConfig, get_peft_model
from huggingface_hub import hf_hub_download
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Download model
checkpoint_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")

# Load processor and tokenizer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    use_fast=False,
    trust_remote_code=True,
    padding_side="right"
)
tokenizer.add_special_tokens({
    "additional_special_tokens": ["<|start-latent|>", "<|end-latent|>", "<|latent|>"]
})

# Load base model with LoRA
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="eager"
)
base_model.resize_token_embeddings(len(tokenizer))
processor.tokenizer = tokenizer

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    r=64, lora_alpha=16, lora_dropout=0.05, bias="none", inference_mode=False
)
base_model = get_peft_model(base_model, lora_config)

# Create IVTLR model
latent_id = tokenizer.convert_tokens_to_ids("<|latent|>")
start_id = tokenizer.convert_tokens_to_ids("<|start-latent|>")
end_id = tokenizer.convert_tokens_to_ids("<|end-latent|>")
image_token_id = tokenizer.convert_tokens_to_ids(processor.image_token)
visual_start_id = tokenizer.convert_tokens_to_ids("<|vision_start|>")
visual_end_id = tokenizer.convert_tokens_to_ids("<|vision_end|>")

model = IVTLR(
    base_model,
    latent_token_id=latent_id,
    start_latent_id=start_id,
    end_latent_id=end_id,
    eos_token_id=tokenizer.eos_token_id,
    image_token_id=image_token_id,
    visual_start_id=visual_start_id,
    visual_end_id=visual_end_id
)

# Load checkpoint
state_dict = torch.load(checkpoint_path, map_location="cpu")
if any(k.startswith("module.") for k in state_dict.keys()):
    state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()

# ============ Inference ============
# Replace with your own image and text
image = "your_image.jpg"  # PIL Image or path to image
text = "Your question here"

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image, "resized_height": 280, "resized_width": 280},
        {"type": "text", "text": text}
    ]
}]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = prompt + "<|latent|>" * 3  # Add latent tokens

image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[prompt],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pixel_values=inputs["pixel_values"],
        image_grid_thw=inputs["image_grid_thw"],
        max_new_tokens=512
    )

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

---

## Citation

If you use **IVT-LR** in your research or applications, please consider citing:

```bibtex
@article{chen2025reasoning,
  title={Reasoning in the dark: Interleaved vision-text reasoning in latent space},
  author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang},
  journal={arXiv preprint arXiv:2510.12603},
  year={2025}
}
```