|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion |
|
|
|
|
|
## π© Overview |
|
|
<p align="center"> |
|
|
<img src="https://raw.githubusercontent.com/yuemingPAN/SFD/main/images/teaser_v5.png" width="95%"> |
|
|
</p> |
|
|
|
|
|
<div align="center" style="max-width:900px; text-align:justify; font-size:14px; line-height:1.5;"> |
|
|
<p> |
|
|
<strong>(a) Overview of Semantic-First Diffusion (SFD).</strong> |
|
|
Semantics (dashed curve) and textures (solid curve) follow asynchronous denoising trajectories. |
|
|
SFD operates in three phases: |
|
|
<span style="color:#d62728;">Stage I β Semantic initialization</span>, where semantic latents denoise first; |
|
|
<span style="color:#4472c4;">Stage II β Asynchronous generation</span>, where semantics and textures denoise jointly but asynchronously, with semantics ahead of textures; |
|
|
<span style="color:#2ca02c;">Stage III β Texture completion</span>, where only textures continue refining. |
|
|
After denoising, the generated semantic latent <b>sβ</b> is discarded, and the final image is decoded solely from the texture latent <b>zβ</b>. |
|
|
<strong>(b) Training convergence on ImageNet 256Γ256 without guidance.</strong> |
|
|
SFD achieves substantially faster convergence than DiT-XL/2 and LightningDiT-XL/1 by approximately <b>100Γ</b> and <b>33.3Γ</b>, respectively. |
|
|
</p> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## β¨ Highlights |
|
|
- We propose **Semantic-First Diffusion (SFD)**, a novel latent diffusion paradigm that performs asynchronous denoising on semantic and texture latents, allowing semantics to denoise earlier and subsequently guide texture generation. |
|
|
- **SFD achieves state-of-the-art FID score of 1.04** on ImageNet 256Γ256 generation. |
|
|
- Exhibits **100Γ** and **33.3Γ faster** training convergence compared to **DiT** and **LightningDiT**, respectively. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Quantitative Results |
|
|
Explicitly **leading semantics ahead of textures with a moderate offset (Ξt = 0.3)** achieves an optimal balance between early semantic stabilization and texture collaboration, effectively harmonizing their joint modeling. |
|
|
<p align="center"> |
|
|
<img src="https://raw.githubusercontent.com/yuemingPAN/SFD/main/images/fid_vs_delta_t.png" width="50%"> |
|
|
</p> |
|
|
|
|
|
|
|
|
### With AutoGuidance |
|
|
|
|
|
| Model | Epochs | #Params | FID (NPU) | |
|
|
|:--------|:-------:|:--------:|:----------:| |
|
|
| SFD-XL | 80 | 675M | 1.30 | |
|
|
| SFD-XL | 800 | 675M | **1.06** | |
|
|
| SFD-XXL | 80 | 1.0B | 1.19 | |
|
|
| SFD-XXL | 800 | 1.0B | **1.04** | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π¨ Visual Results |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://raw.githubusercontent.com/yuemingPAN/SFD/main/images/demo_Sample.png" width="90%"> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
## π Links |
|
|
- π **Project Page:** [https://yuemingpan.github.io/SFD.github.io/](https://yuemingpan.github.io/SFD.github.io/) |
|
|
- π **Paper (arXiv):** [https://arxiv.org/pdf/2512.04926](https://arxiv.org/pdf/2512.04926) |
|
|
- πΎ **Code:** [https://github.com/yuemingPAN/SFD](https://github.com/yuemingPAN/SFD) |
|
|
- π§° **License:** MIT |
|
|
|
|
|
--- |
|
|
|
|
|
## π§© Citation |
|
|
```bibtex |
|
|
@article{Pan2025SFD, |
|
|
title={Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion}, |
|
|
author={Pan, Yueming and Feng, Ruoyu and Dai, Qi and Wang, Yuqi and Lin, Wenfeng and Guo, Mingyu and Luo, Chong and Zheng, Nanning}, |
|
|
journal={arXiv preprint arXiv:2512.04926}, |
|
|
year={2025} |
|
|
} |
|
|
|
|
|
|