FYYDCC commited on
Commit
024a05a
·
verified ·
1 Parent(s): 23336b5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IVT-LR
2
+
3
+ ## Overview
4
+
5
+ Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
6
+
7
+ ---
8
+
9
+ ## Usage
10
+
11
+ This repository provides pretrained models for **Qwen2-VL on M3CoT** and **Chameleon on ScienceQA**.
12
+
13
+ To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/FYYDCC/IVT-LR).
14
+
15
+ ---
16
+
17
+ ### Download Models
18
+
19
+ You can download the models directly from Hugging Face using `huggingface_hub`:
20
+
21
+ ```python
22
+ from huggingface_hub import hf_hub_download
23
+
24
+ # Example: download Qwen2-VL model
25
+ qwen_model_path = hf_hub_download("FYYDCC/IVTLR", "qwen_vl/model.pth")
26
+
27
+ # Example: download Chameleon model
28
+ chameleon_model_path = hf_hub_download("FYYDCC/IVTLR", "chameleon/model.pth")