File size: 7,112 Bytes
808d012 4abc914 0b94e29 4abc914 e3029ac 4abc914 e3029ac 4abc914 e3029ac 0b94e29 e3029ac 0b94e29 e3029ac 0b94e29 4abc914 e3029ac 0b94e29 85d1eb3 0b94e29 85d1eb3 0b94e29 e3029ac 0b94e29 85d1eb3 0b94e29 e3029ac 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 e3029ac 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 0b94e29 4abc914 712e480 0b94e29 712e480 4abc914 0b94e29 e3029ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
---
license: mit
tags:
- text-to-image
- diffusion
- multi-expert
- dit
- laion
- distributed
- decentralized
- flow-matching
---
<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/>
<h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1>
<a href="https://huggingface.co/bageldotcom/paris" target="_blank">
<img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40">
</a>
<a href="https://github.com/bageldotcom/paris" target="_blank">
<img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40">
</a>
<a href="https://github.com/bageldotcom/paris/blob/main/paper.pdf" target="_blank">
<img src="https://img.shields.io/badge/📄_READ_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Technical Report" height="40">
</a>
<div style="margin-top: 20px;"></div>
The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/paris/blob/main/paper.pdf) to learn more.
# Key Characteristics
- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
- Lightweight transformer router (~129M parameters) for dynamic expert selection
- 11M LAION-Aesthetic images across 120 A40 GPU-days
- 14× less training data than prior decentralized baselines
- 16× less compute than prior decentralized baselines
- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
- Open weights for research and commercial use under MIT license
---
# Examples

*Text-conditioned image generation samples using Paris across diverse prompts and visual styles*
---
# Architecture Details
| Component | Specification |
|-----------|--------------|
| **Model Scale** | DiT-XL/2 |
| **Parameters per Expert** | 605M |
| **Total Expert Parameters** | 4.84B (8 experts) |
| **Router Parameters** | ~129M |
| **Hidden Dimensions** | 1152 |
| **Transformer Layers** | 28 |
| **Attention Heads** | 16 |
| **Patch Size** | 2×2 (latent space) |
| **Latent Resolution** | 32×32×4 |
| **Image Resolution** | 256×256 |
| **Text Conditioning** | CLIP ViT-L/14 |
| **VAE** | sd-vae-ft-mse (8× downsampling) |
---
# Training Approach
Paris implements fully decentralized training where:
- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
- No gradient synchronization, parameter sharing, or activation exchange between experts during training
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
- Router trained post-hoc on full dataset for expert selection during inference
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

*Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.*
This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.
**Comparison with Traditional Parallelization**
| **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** |
|--------------|---------------------|---------------------|---------------------------|
| Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster |
| Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline |
| Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline |
| **Paris** | **No synchronization** | **No blocking** | **Arbitrary** |
---
### Routing Strategies
- **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality.
- **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
- **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost).

*Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).*
---
# Performance Metrics
**Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)**
| **Inference Strategy** | **FID-50K ↓** |
|------------------------|---------------|
| Monolithic (single model) | 29.64 |
| Paris Top-1 | 30.60 |
| **Paris Top-2** | **22.60** |
| Paris Full Ensemble | 47.89 |
*Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.*
---
# Training Details
**Hyperparameters (DiT-XL/2)**
| **Parameter** | **Value** |
|---------------|-----------|
| Dataset | LAION-Aesthetic (11M images) |
| Clustering | DINOv2 semantic features |
| Batch Size | 16 per expert (effective 32 with 2-step accumulation) |
| Learning Rate | 2e-5 (AdamW, no scheduling) |
| Training Steps | ~120k total across experts (asynchronous) |
| EMA Decay | 0.9999 |
| Mixed Precision | FP16 with automatic loss scaling |
| Conditioning | AdaLN-Single (23% parameter reduction) |
**Router Training**
| **Parameter** | **Value** |
|---------------|-----------|
| Architecture | DiT-B (smaller than experts) |
| Batch Size | 64 with 4-step accumulation (effective 256) |
| Learning Rate | 5e-5 with cosine annealing (25 epochs) |
| Loss | Cross-entropy on cluster assignments |
| Training | Post-hoc on full dataset |
---
# Citation
```bibtex
@misc{jiang2025paris,
title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
year={2025},
eprint={2510.03434},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2510.03434}
}
```
---
# License
MIT License – Open for research and commercial use.
Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a> |