FYYDCC's picture
Update README.md
a954d81 verified
---
license: mit
---
# IVT-LR (Chameleon)
## Overview
This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
---
## Usage
This repository provides pretrained Chameleon models for IVT-LR on **M3CoT** and **ScienceQA** datasets.
To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).
---
### Download Models
You can download the models directly from Hugging Face using `huggingface_hub`:
```python
from huggingface_hub import hf_hub_download
# Download Chameleon model trained on M3CoT
chameleon_m3cot_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_M3COT", "model.pth")
# Download Chameleon model trained on ScienceQA
chameleon_sqa_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_SQA", "model.pth")
```
---
### Quick Start
The following code shows how to load the pretrained IVT-LR model and run inference on a single image-text example. Replace `image` and `text` with your own input.
```python
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
from chameleon_ivtlr import IVTLR
from peft import LoraConfig, get_peft_model
from huggingface_hub import hf_hub_download
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# Download model
checkpoint_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_M3COT", "model.pth")
# Load processor and tokenizer
processor = ChameleonProcessor.from_pretrained("facebook/chameleon-7b")
tokenizer = processor.tokenizer
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens({
"additional_special_tokens": ["<|start-latent|>", "<|end-latent|>", "<|latent|>"]
})
# Load base model with LoRA
base_model = ChameleonForConditionalGeneration.from_pretrained(
"facebook/chameleon-7b",
device_map="cuda",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
attn_implementation="eager"
)
base_model.resize_token_embeddings(len(tokenizer))
processor.tokenizer = tokenizer
lora_config = LoraConfig(
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
r=64, lora_alpha=16, lora_dropout=0.05, bias="none", inference_mode=False
)
base_model = get_peft_model(base_model, lora_config)
# Create IVTLR model
latent_id = tokenizer.convert_tokens_to_ids("<|latent|>")
start_id = tokenizer.convert_tokens_to_ids("<|start-latent|>")
end_id = tokenizer.convert_tokens_to_ids("<|end-latent|>")
image_token_id = tokenizer.convert_tokens_to_ids(processor.image_token)
model = IVTLR(
base_model,
latent_token_id=latent_id,
start_latent_id=start_id,
end_latent_id=end_id,
eos_token_id=tokenizer.eos_token_id,
image_token_id=image_token_id
)
# Load checkpoint
state_dict = torch.load(checkpoint_path, map_location="cpu")
if any(k.startswith("module.") for k in state_dict.keys()):
state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=True)
model = model.to(device)
model.eval()
# ============ Inference ============
# Replace with your own image and text
image = "your_image.jpg" # PIL Image or path to image
text = "Your question here"
prompt = f"<image>{text}<|latent|><|latent|><|latent|>"
inputs = processor(
images=image,
text=prompt,
return_tensors="pt"
).to(device)
with torch.no_grad():
outputs = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
pixel_values=inputs["pixel_values"],
max_new_tokens=512
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```
---
## Citation
If you use **IVT-LR** in your research or applications, please consider citing:
```bibtex
@article{chen2025reasoning,
title={Reasoning in the dark: Interleaved vision-text reasoning in latent space},
author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang},
journal={arXiv preprint arXiv:2510.12603},
year={2025}
}
```