FYYDCC
/

IVTLR

Image-Text-to-Text

Model card Files Files and versions

IVTLR / README.md

FYYDCC's picture

Add link to paper (#2)

8b0fd47 verified 4 months ago

|

history blame contribute delete

1.35 kB

	---
	license: cc-by-nc-4.0
	pipeline_tag: image-text-to-text
	---

	# IVT-LR

	## Overview

	This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).

	Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text and latent vision. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.

	---

	## Usage

	This repository provides pretrained models for Qwen2-VL on M3CoT and Chameleon on ScienceQA.

	To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/FYYDCC/IVT-LR).

	---

	### Download Models

	You can download the models directly from Hugging Face using `huggingface_hub`:

	```python
	from huggingface_hub import hf_hub_download

	# Example: download Qwen2-VL model
	qwen_model_path = hf_hub_download("FYYDCC/IVTLR", "qwen_vl/model.pth")

	# Example: download Chameleon model
	chameleon_model_path = hf_hub_download("FYYDCC/IVTLR", "chameleon/model.pth")
	```