--- license: mit --- # IVT-LR (Qwen2-VL) ## Overview This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603). Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. --- ## Usage This repository provides pretrained Qwen2-VL models for IVT-LR on **M3CoT** and **ScienceQA** datasets. To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR). --- ### Download Models You can download the models directly from Hugging Face using `huggingface_hub`: ```python from huggingface_hub import hf_hub_download # Download Qwen2-VL model trained on M3CoT qwen_m3cot_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth") # Download Qwen2-VL model trained on ScienceQA qwen_sqa_path = hf_hub_download("ModalityDance/IVTLR_QWEN_SQA", "model.pth") ``` --- ### Quick Start The following code shows how to load the pretrained IVT-LR model and run inference on a single image-text example. Replace `image` and `text` with your own input. ```python from transformers import AutoTokenizer, AutoProcessor, Qwen2VLForConditionalGeneration from qwen_ivtlr import IVTLR from qwen_vl_utils import process_vision_info from peft import LoraConfig, get_peft_model from huggingface_hub import hf_hub_download import torch device = "cuda" if torch.cuda.is_available() else "cpu" # Download model checkpoint_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth") # Load processor and tokenizer processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct") tokenizer = AutoTokenizer.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", use_fast=False, trust_remote_code=True, padding_side="right" ) tokenizer.add_special_tokens({ "additional_special_tokens": ["<|start-latent|>", "<|end-latent|>", "<|latent|>"] }) # Load base model with LoRA base_model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", device_map="cuda", torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="eager" ) base_model.resize_token_embeddings(len(tokenizer)) processor.tokenizer = tokenizer lora_config = LoraConfig( task_type="CAUSAL_LM", target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], r=64, lora_alpha=16, lora_dropout=0.05, bias="none", inference_mode=False ) base_model = get_peft_model(base_model, lora_config) # Create IVTLR model latent_id = tokenizer.convert_tokens_to_ids("<|latent|>") start_id = tokenizer.convert_tokens_to_ids("<|start-latent|>") end_id = tokenizer.convert_tokens_to_ids("<|end-latent|>") image_token_id = tokenizer.convert_tokens_to_ids(processor.image_token) visual_start_id = tokenizer.convert_tokens_to_ids("<|vision_start|>") visual_end_id = tokenizer.convert_tokens_to_ids("<|vision_end|>") model = IVTLR( base_model, latent_token_id=latent_id, start_latent_id=start_id, end_latent_id=end_id, eos_token_id=tokenizer.eos_token_id, image_token_id=image_token_id, visual_start_id=visual_start_id, visual_end_id=visual_end_id ) # Load checkpoint state_dict = torch.load(checkpoint_path, map_location="cpu") if any(k.startswith("module.") for k in state_dict.keys()): state_dict = {k.replace("module.", ""): v for k, v in state_dict.items()} model.load_state_dict(state_dict, strict=True) model = model.to(device) model.eval() # ============ Inference ============ # Replace with your own image and text image = "your_image.jpg" # PIL Image or path to image text = "Your question here" messages = [{ "role": "user", "content": [ {"type": "image", "image": image, "resized_height": 280, "resized_width": 280}, {"type": "text", "text": text} ] }] prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) prompt = prompt + "<|latent|>" * 3 # Add latent tokens image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[prompt], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt" ).to(device) with torch.no_grad(): outputs = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], pixel_values=inputs["pixel_values"], image_grid_thw=inputs["image_grid_thw"], max_new_tokens=512 ) response = processor.decode(outputs[0], skip_special_tokens=True) print(response) ``` --- ## Citation If you use **IVT-LR** in your research or applications, please consider citing: ```bibtex @article{chen2025reasoning, title={Reasoning in the dark: Interleaved vision-text reasoning in latent space}, author={Chen, Chao and Ma, Zhixin and Li, Yongqi and Hu, Yupeng and Wei, Yinwei and Li, Wenjie and Nie, Liqiang}, journal={arXiv preprint arXiv:2510.12603}, year={2025} } ```