Add link to paper
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -7,6 +7,8 @@ pipeline_tag: image-text-to-text
|
|
| 7 |
|
| 8 |
## Overview
|
| 9 |
|
|
|
|
|
|
|
| 10 |
Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
|
| 11 |
|
| 12 |
---
|
|
|
|
| 7 |
|
| 8 |
## Overview
|
| 9 |
|
| 10 |
+
This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
|
| 11 |
+
|
| 12 |
Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
|
| 13 |
|
| 14 |
---
|