ModalityDance
/

IVTLR_CHAMELEON_M3COT

Model card Files Files and versions

FYYDCC commited on 17 days ago

Commit

765f2c5

·

verified ·

1 Parent(s): 0278297

Create README.md

Files changed (1) hide show

README.md +34 -0

README.md ADDED Viewed

	@@ -0,0 +1,34 @@

+---
+license: mit
+---
+# IVT-LR (Chameleon)
+## Overview
+This model was presented in the paper [Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space](https://huggingface.co/papers/2510.12603).
+Interleaved Vision-Text Latent Reasoning (IVT-LR) is the first VLM framework that unifies textual and visual representations in the latent space and implements multimodal latent reasoning. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: **latent text** and **latent vision**. We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps.
+---
+## Usage
+This repository provides pretrained Chameleon models for IVT-LR on **M3CoT** and **ScienceQA** datasets.
+To see detailed usage, including inference code and scripts for training, please refer to the [GitHub repository](https://github.com/ModalityDance/IVT-LR).
+---
+### Download Models
+You can download the models directly from Hugging Face using `huggingface_hub`:
+```python
+from huggingface_hub import hf_hub_download
+# Download Chameleon model trained on M3CoT
+chameleon_m3cot_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_M3COT", "model.pth")
+# Download Chameleon model trained on ScienceQA
+chameleon_sqa_path = hf_hub_download("ModalityDance/IVTLR_CHAMELEON_SQA", "model.pth")
+```