IVTLR_QWEN_M3COT / README.md

Update README.md

7c3161a verified 12 days ago

5.4 kB

	---
	license: mit
	---
	# IVT-LR (Qwen2-VL)

	## Overview

	This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).

	Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text and latent vision. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.

	---

	## Usage

	This repository provides pretrained Qwen2-VL models for IVT-LR on M3CoT and ScienceQA datasets.

	To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).

	---

	### Download Models

	You can download the models directly from Hugging Face using `huggingface_hub`:

	```python
	from huggingface_hub import hf_hub_download

	# Download Qwen2-VL model trained on M3CoT
	qwen_m3cot_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")

	# Download Qwen2-VL model trained on ScienceQA
	qwen_sqa_path = hf_hub_download("ModalityDance/IVTLR_QWEN_SQA", "model.pth")
	```

	---

	### Quick Start

	The following code shows how to load the pretrained IVT-LR model and run inference on a single image-text example. Replace `image` and `text` with your own input.


	```python
	from transformers import AutoTokenizer, AutoProcessor, Qwen2VLForConditionalGeneration
	from qwen_ivtlr import IVTLR
	from qwen_vl_utils import process_vision_info
	from peft import LoraConfig, get_peft_model
	from huggingface_hub import hf_hub_download
	import torch

	device = "cuda" if torch.cuda.is_available() else "cpu"

	# Download model
	checkpoint_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")

	# Load processor and tokenizer
	processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
	tokenizer = AutoTokenizer.from_pretrained(
	"Qwen/Qwen2-VL-7B-Instruct",
	use_fast=False,
	trust_remote_code=True,
	padding_side="right"
	)
	tokenizer.add_special_tokens({
	"additional_special_tokens": ["<\|start-latent\|>", "<\|end-latent\|>", "<\|latent\|>"]
	})

	# Load base model with LoRA
	base_model = Qwen2VLForConditionalGeneration.from_pretrained(
	"Qwen/Qwen2-VL-7B-Instruct",
	device_map="cuda",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	attn_implementation="eager"
	)
	base_model.resize_token_embeddings(len(tokenizer))
	processor.tokenizer = tokenizer

	lora_config = LoraConfig(
	task_type="CAUSAL_LM",
	target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
	r=64, lora_alpha=16, lora_dropout=0.05, bias="none", inference_mode=False
	)
	base_model = get_peft_model(base_model, lora_config)

	# Create IVTLR model
	latent_id = tokenizer.convert_tokens_to_ids("<\|latent\|>")
	start_id = tokenizer.convert_tokens_to_ids("<\|start-latent\|>")
	end_id = tokenizer.convert_tokens_to_ids("<\|end-latent\|>")
	image_token_id = tokenizer.convert_tokens_to_ids(processor.image_token)
	visual_start_id = tokenizer.convert_tokens_to_ids("<\|vision_start\|>")
	visual_end_id = tokenizer.convert_tokens_to_ids("<\|vision_end\|>")

	model = IVTLR(
	base_model,
	latent_token_id=latent_id,
	start_latent_id=start_id,
	end_latent_id=end_id,
	eos_token_id=tokenizer.eos_token_id,
	image_token_id=image_token_id,
	visual_start_id=visual_start_id,
	visual_end_id=visual_end_id
	)

	# Load checkpoint
	state_dict = torch.load(checkpoint_path, map_location="cpu")
	if any(k.startswith("module.") for k in state_dict.keys()):
	state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()}
	model.load_state_dict(state_dict, strict=True)
	model = model.to(device)
	model.eval()

	# ============ Inference ============
	# Replace with your own image and text
	image = "your_image.jpg" # PIL Image or path to image
	text = "Your question here"

	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": image, "resized_height": 280, "resized_width": 280},
	{"type": "text", "text": text}
	]
	}]

	prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	prompt = prompt + "<\|latent\|>" * 3 # Add latent tokens

	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[prompt],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt"
	).to(device)

	with torch.no_grad():
	outputs = model.generate(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	pixel_values=inputs["pixel_values"],
	image_grid_thw=inputs["image_grid_thw"],
	max_new_tokens=512
	)

	response = processor.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	---

	## Citation

	If you use IVT-LR in your research or applications, please consider citing:

	```bibtex
	@article{chen2025reasoning,
	title={Reasoning in the dark: Interleaved vision-text reasoning in latent space},
	author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang},
	journal={arXiv preprint arXiv:2510.12603},
	year={2025}
	}
	```