FYYDCC commited on
Commit
ac3aa33
·
verified ·
1 Parent(s): f33f7a3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -0
README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IVT-LR (Qwen2-VL)
2
+
3
+ ## Overview
4
+
5
+ This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
6
+
7
+ Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
8
+
9
+ ---
10
+
11
+ ## Usage
12
+
13
+ This repository provides pretrained Qwen2-VL models for IVT-LR on **M3CoT** and **ScienceQA** datasets.
14
+
15
+ To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).
16
+
17
+ ---
18
+
19
+ ### Download Models
20
+
21
+ You can download the models directly from Hugging Face using `huggingface_hub`:
22
+
23
+ ```python
24
+ from huggingface_hub import hf_hub_download
25
+
26
+ # Download Qwen2-VL model trained on M3CoT
27
+ qwen_m3cot_path = hf_hub_download("ModalityDance/IVTLR_QWEN_M3COT", "model.pth")
28
+
29
+ # Download Qwen2-VL model trained on ScienceQA
30
+ qwen_sqa_path = hf_hub_download("ModalityDance/IVTLR_QWEN_SQA", "model.pth")
31
+ ```
32
+