File size: 4,683 Bytes
a853c3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00e9114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2bae0bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: mit
---

# IVT-LR (Chameleon)

## Overview

This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).

Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.

---

## Usage

This repository provides pretrained Chameleon models for IVT-LR on **M3CoT** and **ScienceQA** datasets.

To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).

---

### Download Models

You can download the models directly from Hugging Face using `huggingface_hub`:

```python
from huggingface_hub import hf_hub_download

# Download Chameleon model trained on M3CoT
chameleon_m3cot_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_M3COT", "model.pth")

# Download Chameleon model trained on ScienceQA
chameleon_sqa_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_SQA", "model.pth")
```

---

### Quick Start

The following code shows how to load the pretrained IVT-LR model and run inference on a single image-text example. Replace `image` and `text` with your own input.

```python
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
from chameleon_ivtlr import IVTLR
from peft import LoraConfig, get_peft_model
from huggingface_hub import hf_hub_download
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Download model
checkpoint_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_M3COT", "model.pth")

# Load processor and tokenizer
processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
tokenizer = processor.tokenizer
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens({
    "additional_special_tokens": ["<|start-latent|>", "<|end-latent|>", "<|latent|>"]
})

# Load base model with LoRA
base_model = ChameleonForConditionalGeneration.from_pretrained(
    "facebook/chameleon-7b",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    attn_implementation="eager"
)
base_model.resize_token_embeddings(len(tokenizer))
processor.tokenizer = tokenizer

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    r=64, lora_alpha=16, lora_dropout=0.05, bias="none", inference_mode=False
)
base_model = get_peft_model(base_model, lora_config)

# Create IVTLR model
latent_id = tokenizer.convert_tokens_to_ids("<|latent|>")
start_id = tokenizer.convert_tokens_to_ids("<|start-latent|>")
end_id = tokenizer.convert_tokens_to_ids("<|end-latent|>")
image_token_id = tokenizer.convert_tokens_to_ids(processor.image_token)

model = IVTLR(
    base_model,
    latent_token_id=latent_id,
    start_latent_id=start_id,
    end_latent_id=end_id,
    eos_token_id=tokenizer.eos_token_id,
    image_token_id=image_token_id
)

# Load checkpoint
state_dict = torch.load(checkpoint_path, map_location="cpu")
if any(k.startswith("module.") for k in state_dict.keys()):
    state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()

# ============ Inference ============
# Replace with your own image and text
image = "your_image.jpg"  # PIL Image or path to image
text = "Your question here"

prompt = f"<image>{text}<|latent|><|latent|><|latent|>"

inputs = processor(
    images=image,
    text=prompt,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=512
    )

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

---

## Citation

If you use **IVT-LR** in your research or applications, please consider citing:

```bibtex
@article{chen2025reasoning,
  title={Reasoning in the dark: Interleaved vision-text reasoning in latent space},
  author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang},
  journal={arXiv preprint arXiv:2510.12603},
  year={2025}
}
```