FYYDCC
/

IVTLR

Add link to paper

by nielsr HF Staff - opened Oct 16, 2025

←

Files changed (1) hide show

README.md CHANGED Viewed

@@ -7,6 +7,8 @@ pipeline_tag: image-text-to-text
 ## Overview
 Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
 ---

 ## Overview
+This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
 Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
 ---