EndoCoT / README.md

Add library_name and pipeline_tag to metadata (#1)

b5898a3 2 days ago

8.47 kB

	---
	base_model:
	- Qwen/Qwen-Image-Edit-2511
	datasets:
	- internlm/EndoCoT-Data
	language:
	- en
	license: mit
	library_name: diffusers
	pipeline_tag: image-to-image
	---

	<p align="center"> <img src="fig/banner.svg" alt="EndoCoT" width="900"/> </p>

	<p align="center">
	<a href="https://github.com/InternLM/EndoCoT"><img src="https://img.shields.io/github/stars/InternLM/EndoCoT?style=flat-square&logo=github&label=Stars&color=FFB300"></a>
	<a href="https://github.com/InternLM/EndoCoT/forks"><img src="https://img.shields.io/github/forks/InternLM/EndoCoT?style=flat-square&logo=github&label=Forks&color=2196F3"></a>
	<a href="https://github.com/InternLM/EndoCoT/issues"><img src="https://img.shields.io/github/issues/InternLM/EndoCoT?style=flat-square&logo=github&label=Issues&color=4CAF50"></a>
	<a href="https://github.com/InternLM/EndoCoT/blob/main/LICENSE"><img src="https://img.shields.io/github/license/InternLM/EndoCoT?style=flat-square&label=License&color=9C27B0"></a>
	<br>
	<a href="https://arxiv.org/abs/2603.12252"><img src="https://img.shields.io/badge/Paper-arXiv-B31B1B?style=flat-square"></a>
	<a href="https://internlm.github.io/EndoCoT/"><img src="https://img.shields.io/badge/Homepage-Project-blue?style=flat-square"></a>
	<a href="https://huggingface.co/internlm/EndoCoT"><img src="https://img.shields.io/badge/Model-HuggingFace-yellow?style=flat-square"></a>
	<a href="https://huggingface.co/datasets/internlm/EndoCoT-Data"><img src="https://img.shields.io/badge/Dataset-HuggingFace-orange?style=flat-square"></a>
	<br>
	<br>
	<img src="fig/teaser.jpg" alt="Teaser" width="100%" style="border-radius: 10px; box-shadow: 0 6px 20px rgba(0,0,0,0.2);">
	</p>




	# EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

	This repository contains the official model checkpoints for EndoCoT, as presented in the paper [EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models](https://huggingface.co/papers/2603.12252).

	## 📝TODO

	- [x] Open source the training code
	- [ ] Open source the training data
	- [x] Open source the main task ckpt
	- [ ] Open source the edit model ckpt
	- [ ] Refactor the codebase for better usability and maintainability

	## 📰News

	- 🚀 [2026/3/12] We have released the EndoCoT [repository](https://github.com/InternLM/EndoCoT) and [ckpts](https://huggingface.co/internlm/EndoCoT).

	## 🌟Highlight

	![main](fig/main.jpg)

	- EndoCoT is a reasoning paradigm for diffusion models that enables step-by-step inference. It outperforms conventional training methods on Qwen-Image-Edit-2511.

	![exp](fig/exp.png)

	- And provide transparent, intermediate reasoning trajectories.

	![case](fig/case.jpg)

	## ⚡Quick Start

	### Setup environment

	```bash
	git clone https://github.com/InternLM/EndoCoT
	cd EndoCoT
	conda create -n EndoCoT python=3.10
	conda activate EndoCot
	# Please install the version of torch compatible with your machine.
	pip install -r requirements.txt
	# Please install the version of vLLM compatible with your machine.
	```

	### Inference

	1. Download the ckpt:

	- You may find our pretrained weights at: [EndoCoT](https://huggingface.co/InternLM/EndoCoT)

	> Following the configuration of [Diffthinker](https://github.com/lcqysl/DiffThinker), we provide a customized checkpoint for Qwen-Image-Edit. This checkpoint has been merged from the original `safetensors` to ensure compatibility with[Diffsynth-Studio](https://github.com/modelscope/DiffSynth-Studio) training. Please use the checkpoint provided in this repository instead of the official version for correct loading and inference.

	2. Test Single Case

	```bash
	cd test
	python test.py \
	--task Maze \
	--model_root /path/to/merged_ckpts \
	--lora_path /path/to/your_lora_weight.safetensors \
	--input_image ./data/sudoku_sample.png \
	--output_dir ./outputs/sudoku_results
	```

	3. Eval Our Ckpt

	> We follow the exact same setting as [Diffthinker](https://github.com/lcqysl/DiffThinker)

	```bash
	cd Maze
	bash eval/gen_and_parse.sh
	bash eval/eval_path.sh
	```

	### Training

	1. Download the datasets & metadata.csv

	- You may find our training data at: [EndoCoT dataset](https://huggingface.co/datasets/internlm/EndoCoT-Data)

	> Since the metadata uses relative paths, please ensure the dataset files are placed in the same directory as `metadata.csv`

	2. Train your model

	```bash
	cd DiffSynth-Studio
	bash add/Maze/stage1.sh
	python change_ckpt_prefix.py --src /path/to/the/Maze/save/dir/Maze_stage1
	bash add/Maze/stage2.sh
	python change_ckpt_prefix.py --src /path/to/the/Maze/save/dir/Maze_stage2
	```

	### How to change the latent reasoning steps?

	> Note on Customization: Since the current implementation is straightforward, you can only manually adjust the latent reasoning steps in `DiffSynth-Studio/diffsynth/pipelines/qwen_image.py`:
	>
	> - Line 442: Modify `infer_steps`.
	> - Line 471: Modify `training_steps`.
	>
	> ##### We plan to optimize this in future releases.

	```python
	def encode_prompt_edit(self, pipe: QwenImagePipeline, prompt, edit_image, is_final, gt_prompt=None, idx=None):

	drop_idx = 64
	if type(prompt[0])==str:
	template = "<\|im_start\|>system
	Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.<\|im_end\|>
	<\|im_start\|>user
	<\|vision_start\|><\|image_pad\|><\|vision_end\|>{}<\|im_end\|>
	<\|im_start\|>assistant
	"
	txt = template.format(prompt[0])
	model_inputs = pipe.processor(text=txt, images=edit_image, padding=True, return_tensors="pt").to(pipe.device)
	embedding_layers = pipe.text_encoder.model.language_model.get_input_embeddings()
	with torch.no_grad():
	inputs_embeds = embedding_layers(model_inputs.input_ids)
	self.attention_mask = model_inputs.attention_mask
	self.pixel_values = model_inputs.pixel_values
	self.image_grid_thw = model_inputs.image_grid_thw
	else:
	inputs_embeds= prompt[0]

	# dxl: test use
	if is_final==None or idx!=None:
	print("现在在inference。或者stage2训练")
	if idx!=None:
	iter_times = idx-2
	else:
	# infer step
	iter_times = 50

	with torch.no_grad():
	inputs_embeds = self.manual_generate_eval(
	pipe,
	inputs_embeds=inputs_embeds,
	max_new_tokens=iter_times,
	).detach()

	# dxl: only update the last 2 tokens
	if idx!=None:
	inputs_embeds = self.manual_generate_eval(
	pipe,
	inputs_embeds=inputs_embeds,
	max_new_tokens=2,
	)

	generated_embeds = inputs_embeds

	... ...

	# dxl：training
	if is_final!=None and idx==None:
	try:
	generated_embeds, _ = self.manual_generate(
	pipe,
	inputs_embeds=inputs_embeds,
	is_final=is_final,
	# training steps
	max_new_tokens=2,
	)
	except Exception as e:
	print(f"Error!: {type(e).__name__} - {e}")
	print(inputs_embeds.shape)
	assert False

	try:
	return split_hidden_states, generated_embeds, eos_loss
	except:
	print(f"[WARNING] Prompt was not updated correctly for inference.")
	return split_hidden_states
	```

	## 📖 Citation

	```bibtex
	@article{dai2026endocot,
	title={EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models},
	author={Dai, Xuanlang and Zhou, Yujie and Xing, Long and Bu, Jiazi and Wei, Xilin and Liu, Yuhong and Zhang, Beichen and Chen, Kai and Zang, Yuhang},
	journal={arXiv preprint arXiv:2603.12252},
	year={2026}
	}
	```

	## ⚖️ License

	![Code License](https://img.shields.io/badge/Code%20License-MIT-green.svg) ![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)