ThinkMorph
/

ThinkMorph-7B

Any-to-Any

Safetensors

ThinkMorph-7B

Model card Files Files and versions

xet

Community

luckychao commited on Oct 29, 2025

Commit

0a33b90

verified ·

1 Parent(s): d9db579

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -7,7 +7,7 @@ library_name: ThinkMorph-7B
 ---
 <p align="center">
-    <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/logo.png" width="40%"> <br>
 </p>
@@ -58,11 +58,11 @@ For installation, usage instructions, and further documentation, please visit ou
 Multimodal reasoning demands synergistic coordination of language and vision. However, determining what constitutes meaningful interleaved reasoning is non-trivial, and current approaches lack a generalizable recipe.
 We present **ThinkMorph**, a unified model that enables such generalization through a principled approach: treating text and images as complementary modalities that mutually advance reasoning.
 <p align="center">
-    <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/interleaved_design.jpg" width="100%"> <br>
 </p>
 Guided by this principle, we identify tasks requiring concrete, verifiable visual engagement and design a high-quality data pipeline that trains models to generate interleaved images and text as progressive reasoning traces.
 <p align="center">
-    <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/thinkmorph_main.jpg" width="100%"> <br>
 </p>
 ThinkMorph delivers substantial gains on **vision-centric** tasks, achieving an average improvement of 34.74% over the base model while consistently surpassing text-only and image-only modes.
@@ -70,7 +70,7 @@ By fine-tuning with **merely ~24K** samples, it achieves out-of-domain performan
 Intriguingly, ThinkMorph unlocks emergent properties that represent a *hallmark of multimodal intelligence*: the elicitation of unseen visual manipulation skills, the self-adaptive switching between reasoning modes according to task complexity, and better test-time scaling via diversified thoughts.
 <p align="center">
-    <img src="https://github.com/ThinkMorph/ThinkMorph/blob/main/assets/emrging_prop.jpg" width="100%"> <br>
 </p>
 These findings suggest promising directions for future work to characterize the emergent capabilities of unified models for multimodal reasoning.

 ---
 <p align="center">
+    <img src="https://github.com/ThinkMorph/ThinkMorph/raw/main/assets/logo.png" width="40%"> <br>
 </p>
 Multimodal reasoning demands synergistic coordination of language and vision. However, determining what constitutes meaningful interleaved reasoning is non-trivial, and current approaches lack a generalizable recipe.
 We present **ThinkMorph**, a unified model that enables such generalization through a principled approach: treating text and images as complementary modalities that mutually advance reasoning.
 <p align="center">
+    <img src="https://github.com/ThinkMorph/ThinkMorph/raw/main/assets/interleaved_design.jpg" width="100%"> <br>
 </p>
 Guided by this principle, we identify tasks requiring concrete, verifiable visual engagement and design a high-quality data pipeline that trains models to generate interleaved images and text as progressive reasoning traces.
 <p align="center">
+    <img src="https://github.com/ThinkMorph/ThinkMorph/raw/main/assets/thinkmorph_main.jpg" width="100%"> <br>
 </p>
 ThinkMorph delivers substantial gains on **vision-centric** tasks, achieving an average improvement of 34.74% over the base model while consistently surpassing text-only and image-only modes.
 Intriguingly, ThinkMorph unlocks emergent properties that represent a *hallmark of multimodal intelligence*: the elicitation of unseen visual manipulation skills, the self-adaptive switching between reasoning modes according to task complexity, and better test-time scaling via diversified thoughts.
 <p align="center">
+    <img src="https://github.com/ThinkMorph/ThinkMorph/raw/main/assets/emrging_prop.jpg" width="100%"> <br>
 </p>
 These findings suggest promising directions for future work to characterize the emergent capabilities of unified models for multimodal reasoning.