FYYDCC
/

IVTLR

FYYDCC commited on Oct 14, 2025

Commit

024a05a

verified ·

1 Parent(s): 23336b5

Create README.md

Files changed (1) hide show

README.md ADDED Viewed

+# IVT-LR
+## Overview
+Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
+---
+## Usage
+This repository provides pretrained models for **Qwen2-VL on M3CoT** and **Chameleon on ScienceQA**.
+To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/FYYDCC/IVT-LR).
+---
+### Download Models
+You can download the models directly from Hugging Face using `huggingface_hub`:
+```python
+from huggingface_hub import hf_hub_download
+# Example: download Qwen2-VL model
+qwen_model_path = hf_hub_download("FYYDCC/IVTLR", "qwen_vl/model.pth")
+# Example: download Chameleon model
+chameleon_model_path = hf_hub_download("FYYDCC/IVTLR", "chameleon/model.pth")