Update README.md

3d9d248 verified 4 days ago

5.93 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	- Wan-AI/Wan2.1-T2V-1.3B
	pipeline_tag: image-text-to-image
	tags:
	- image-to-image
	- image-editing
	- diffusion
	- computer-vision
	- spatial-editing
	- vision-language
	library_name: transformers
	---

	# SpatialEdit-16B

	SpatialEdit-16B is a research model for fine-grained image spatial editing. It is designed to follow spatial instructions such as object moving, object rotation, and camera-centric editing while preserving scene realism and subject identity as much as possible.

	This model is released as part of the SpatialEdit project:

	- Paper: [SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing](https://arxiv.org/pdf/2604.04911)
	- Code: [SpatialEdit GitHub Repository](https://github.com/EasonXiao-888/SpatialEdit)
	- Training Data: [SpatialEdit-500K](https://huggingface.co/datasets/EasonXiao-888/SpatialEdit-500K)
	- Benchmark: [SpatialEdit-Bench](https://huggingface.co/datasets/EasonXiao-888/SpatialEdit-Bench)

	https://cdn-uploads.huggingface.co/production/uploads/656a12a3d848a6683a6dfb9e/uMD0fka9fN5iBfSNgmDsj.mp4

	## Highlights

	- Fine-grained spatial editing from an input image and instruction
	- Supports object-centric and camera-centric manipulations
	- Trained with the SpatialEdit-500K synthetic data engine
	- Evaluated with SpatialEdit-Bench for both plausibility and geometric faithfulness

	## Overview

	SpatialEdit focuses on spatially grounded image editing. Instead of only changing appearance or style, the model aims to edit geometric attributes of a scene, including:

	- object movement
	- object rotation
	- camera trajectory-related editing

	### Task Definition

	<p align="center">
	<img src="assets/task_definition.png" alt="SpatialEdit task definition" width="95%">
	</p>

	Caption suggestion: Task definition of fine-grained image spatial editing.

	## Application Gallery

	### 3D Point Control

	<p align="center">
	<img src="assets/application/3dpoint/01.gif" width="23%" alt="3D point control example 1" />
	<img src="assets/application/3dpoint/02.gif" width="23%" alt="3D point control example 2" />
	<img src="assets/application/3dpoint/11.gif" width="23%" alt="3D point control example 3" />
	<img src="assets/application/3dpoint/12.gif" width="23%" alt="3D point control example 4" />
	</p>

	The first and third examples show sparse-view point observations. The second and fourth examples illustrate how SpatialEdit can synthesize richer spatial observations from limited inputs.

	### Camera Trajectory Editing

	<p align="center">
	<img src="assets/application/camera/input.png" width="31%" alt="Camera editing input" />
	<img src="assets/application/camera/output.png" width="31%" alt="Camera editing output" />
	<img src="assets/application/camera/video.gif" width="31%" alt="Camera editing transition video" />
	</p>

	Left: input image. Middle: edited target view generated by SpatialEdit. Right: a camera-transition video synthesized from the spatially edited endpoint.

	### Object Translation

	<p align="center">
	<img src="assets/application/moving/input.png" width="31%" alt="Object translation input" />
	<img src="assets/application/moving/output.png" width="31%" alt="Object translation output" />
	<img src="assets/application/moving/video.gif" width="31%" alt="Object translation transition video" />
	</p>

	Left: input image. Middle: translated target result generated by SpatialEdit. Right: an interpolated motion sequence built from the edited endpoint.

	### Object Rotation

	<p align="center">
	<img src="assets/application/rotation/input.png" width="31%" alt="Object rotation input" />
	<img src="assets/application/rotation/output.png" width="31%" alt="Object rotation output" />
	<img src="assets/application/rotation/video.gif" width="31%" alt="Object rotation transition video" />
	</p>

	Left: input image. Middle: rotated target result generated by SpatialEdit. Right: a smooth transition sequence derived from the edited result.

	## Required External Checkpoints

	Before running inference, please download the following dependencies:

	- [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
	- [Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B), including `Wan2.1_VAE.pth`

	## Repository Contents

	This model repository is expected to store the checkpoints used by the official codebase. A typical layout is:

	```bash
	SpatialEdit_CKPT/
	├── CKPT_PT.pth
	└── CKPT_CT_lora/
	```

	- `CKPT_PT.pth`: full DiT checkpoint
	- `CKPT_CT_lora/`: LoRA checkpoint used for spatial editing

	If your uploaded filenames differ, simply update the paths in the provided scripts.

	A recommended local directory structure is:

	```bash
	your_base_path/
	├── SpatialEdit_CKPT/
	│ ├── CKPT_PT.pth
	│ └── CKPT_CT_lora/
	└── model/
	├── Qwen3-VL-8B-Instruct/
	└── Wan2.1-T2V-1.3B/
	└── Wan2.1_VAE.pth
	```

	## Quick Start

	The [SpatialEdit GitHub Repository](https://github.com/EasonXiao-888/SpatialEdit) provides a simple local demo script.

	## Citation

	If you find this project useful, please cite the SpatialEdit paper.

	```bibtex
	@misc{spatialedit,
	title={SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing},
	author={Yicheng Xiao and Wenhu Zhang and Lin Song and Yukang Chen and Wenbo Li and Nan Jiang and Tianhe Ren and Haokun Lin and Wei Huang and Haoyang Huang and Xiu Li and Nan Duan and Xiaojuan Qi},
	year={2026}
	}
	```

	Please replace the BibTeX entry above with the final official citation if needed.

	## Acknowledgement

	This project builds upon several excellent open-source efforts. We sincerely thank:

	- [ReCamMaster](https://github.com/KlingAIResearch/ReCamMaster)
	- [TexVerse](https://github.com/yiboz2001/TexVerse)
	- [JoyAI-Image](https://github.com/jd-opensource/JoyAI-Image)

	We also thank the contributors and collaborators who supported the development of SpatialEdit.