Improve model card for DiffThinker: Add metadata, links, and usage details
Browse filesHi! I'm Niels from the Hugging Face community science team. I've updated the model card for DiffThinker to enhance its discoverability and provide more comprehensive information.
This PR includes the following improvements:
- **Metadata**: Added `pipeline_tag: image-to-image` and `library_name: diffusers` to enable automated usage snippets and improve searchability.
- **Links**: Added a direct link to the official GitHub repository for easy access to the code.
- **Content**: Incorporated a concise summary of the model's purpose, its key features (efficiency, controllability, native parallelism, and collaboration), and detailed "Quick Start" and "Inference & Evaluation" sections with code snippets from the original GitHub README.
- **Citation**: Added a BibTeX entry for proper academic attribution.
These changes will help users better understand and utilize the DiffThinker model.
|
@@ -1,17 +1,45 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen-Image-Edit-2509
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
|
|
|
|
| 9 |
<a href="https://diffthinker-project.github.io/"><img src="https://img.shields.io/badge/%F0%9F%8C%90%20Project-Page-2563eb" alt="Project Page"></a>
|
| 10 |
-
<
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
### Inference & Evaluation
|
| 13 |
-
The test datasets used in our experiments
|
| 14 |
-
|
|
|
|
| 15 |
cd Maze
|
| 16 |
|
| 17 |
# 1. Inference and Parsing
|
|
@@ -23,4 +51,14 @@ bash eval/eval_path.sh
|
|
| 23 |
# 3. Individual Inference
|
| 24 |
python ../DiffSynth-Studio/add/infer/infer.py
|
| 25 |
python ../DiffSynth-Studio/add/infer/infer_with_middle.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen-Image-Edit-2509
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
library_name: diffusers
|
| 8 |
+
pipeline_tag: image-to-image
|
| 9 |
---
|
| 10 |
+
|
| 11 |
# DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models
|
| 12 |
+
|
| 13 |
<a href="https://diffthinker-project.github.io/"><img src="https://img.shields.io/badge/%F0%9F%8C%90%20Project-Page-2563eb" alt="Project Page"></a>
|
| 14 |
+
<a href="https://github.com/lcqysl/DiffThinker"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github" alt="GitHub"></a>
|
| 15 |
+
<a href="https://huggingface.co/papers/2512.24165"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b" alt="Paper"></a>
|
| 16 |
+
|
| 17 |
+
DiffThinker introduces a novel Generative Multimodal Reasoning paradigm, establishing a diffusion-based reasoning framework. It reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks compared to traditional text-centric Multimodal Large Language Models (MLLMs).
|
| 18 |
+
|
| 19 |
+
### Features
|
| 20 |
+
DiffThinker exhibits four core properties in its approach to vision-centric reasoning:
|
| 21 |
+
- **Efficiency**: Streamlined reasoning process.
|
| 22 |
+
- **Controllability**: Precise spatial and logical generation.
|
| 23 |
+
- **Native Parallelism**: Advantageous for complex reasoning steps.
|
| 24 |
+
- **Collaboration**: Works effectively across multiple domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration).
|
| 25 |
+
|
| 26 |
+
### Quick Start
|
| 27 |
+
To get started with DiffThinker, clone the official repository and install the necessary dependencies:
|
| 28 |
+
```bash
|
| 29 |
+
git clone https://github.com/lcqysl/DiffThinker.git
|
| 30 |
+
cd DiffThinker/DiffSynth-Studio
|
| 31 |
+
pip install -e .
|
| 32 |
+
pip install gymnasium
|
| 33 |
+
|
| 34 |
+
# (Optional) Install vLLM for OCR tasks
|
| 35 |
+
# we recommend installing it in a SEPARATE environment to avoid conflicts.
|
| 36 |
+
# pip install vllm
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
### Inference & Evaluation
|
| 40 |
+
The test datasets used in our experiments are provided within each task's directory. We recommend using the same data to ensure the reproducibility of our results and to facilitate comparison with other models. If you wish to generate your own test data, please refer to the `gen.txt` file in each task directory.
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
cd Maze
|
| 44 |
|
| 45 |
# 1. Inference and Parsing
|
|
|
|
| 51 |
# 3. Individual Inference
|
| 52 |
python ../DiffSynth-Studio/add/infer/infer.py
|
| 53 |
python ../DiffSynth-Studio/add/infer/infer_with_middle.py
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
### Citation
|
| 57 |
+
```bibtex
|
| 58 |
+
@article{he2024diffthinker,
|
| 59 |
+
title={DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models},
|
| 60 |
+
author={He, Zefeng and Qu, Xiaoye and Li, Yafu and Zhu, Tong and Huang, Siyuan and Cheng, Yu},
|
| 61 |
+
journal={arXiv preprint arXiv:2512.24165},
|
| 62 |
+
year={2024}
|
| 63 |
+
}
|
| 64 |
```
|