| license: cc-by-nc-4.0 | |
| pipeline_tag: image-text-to-text | |
| # IVT-LR | |
| ## Overview | |
| Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. | |
| --- | |
| ## Usage | |
| This repository provides pretrained models for **Qwen2-VL on M3CoT** and **Chameleon on ScienceQA**. | |
| To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/FYYDCC/IVT-LR). | |
| --- | |
| ### Download Models | |
| You can download the models directly from Hugging Face using `huggingface_hub`: | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| # Example: download Qwen2-VL model | |
| qwen_model_path = hf_hub_download("FYYDCC/IVTLR", "qwen_vl/model.pth") | |
| # Example: download Chameleon model | |
| chameleon_model_path = hf_hub_download("FYYDCC/IVTLR", "chameleon/model.pth") | |
| ``` |