paris

File size: 7,112 Bytes

808d012
 
 
 
4abc914
 
 
 
0b94e29
 
 
4abc914
 
e3029ac
4abc914
e3029ac
4abc914
e3029ac
 
0b94e29
e3029ac
 
0b94e29
e3029ac
 
0b94e29
4abc914
e3029ac
 
 
0b94e29
 
 
 
 
85d1eb3
0b94e29
 
 
85d1eb3
0b94e29
 
 
 
 
 
e3029ac
0b94e29
 
 
 
 
 
 
 
 
 
 
 
85d1eb3
0b94e29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3029ac
0b94e29
 
 
 
4abc914
0b94e29
 
 
 
 
 
 
 
 
 
 
4abc914
0b94e29
 
 
 
 
4abc914
e3029ac
 
 
 
0b94e29
4abc914
0b94e29
4abc914
0b94e29
4abc914
0b94e29
 
 
 
 
 
4abc914
0b94e29
 
 
4abc914
0b94e29
4abc914
0b94e29
4abc914
0b94e29
 
 
 
 
 
 
 
 
 
4abc914
0b94e29
4abc914
0b94e29
 
 
 
 
 
 
 
 
 
 
 
4abc914
 
712e480
0b94e29
 
 
712e480
 
 
 
4abc914
 
 
0b94e29
 
 
 
 
 
e3029ac

---
license: mit
tags:
- text-to-image
- diffusion
- multi-expert
- dit
- laion
- distributed
- decentralized
- flow-matching
---

<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/>

<h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1>

<a href="https://huggingface.co/bageldotcom/paris" target="_blank">
  <img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40">
</a>
<a href="https://github.com/bageldotcom/paris" target="_blank">
  <img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40">
</a>
<a href="https://github.com/bageldotcom/paris/blob/main/paper.pdf" target="_blank">
  <img src="https://img.shields.io/badge/📄_READ_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Technical Report" height="40">
</a>

<div style="margin-top: 20px;"></div>

The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/paris/blob/main/paper.pdf) to learn more.

# Key Characteristics

- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
- Lightweight transformer router (~129M parameters) for dynamic expert selection
- 11M LAION-Aesthetic images across 120 A40 GPU-days
- 14× less training data than prior decentralized baselines
- 16× less compute than prior decentralized baselines
- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
- Open weights for research and commercial use under MIT license

---

# Examples

![Paris Generation Examples](images/generated_images.png)

*Text-conditioned image generation samples using Paris across diverse prompts and visual styles*

---

# Architecture Details

| Component | Specification |
|-----------|--------------|
| **Model Scale** | DiT-XL/2 |
| **Parameters per Expert** | 605M |
| **Total Expert Parameters** | 4.84B (8 experts) |
| **Router Parameters** | ~129M |
| **Hidden Dimensions** | 1152 |
| **Transformer Layers** | 28 |
| **Attention Heads** | 16 |
| **Patch Size** | 2×2 (latent space) |
| **Latent Resolution** | 32×32×4 |
| **Image Resolution** | 256×256 |
| **Text Conditioning** | CLIP ViT-L/14 |
| **VAE** | sd-vae-ft-mse (8× downsampling) |

---

# Training Approach

Paris implements fully decentralized training where:

- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
- No gradient synchronization, parameter sharing, or activation exchange between experts during training
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
- Router trained post-hoc on full dataset for expert selection during inference
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

![Training Architecture](images/training_architecture.png)

*Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.*

This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.

**Comparison with Traditional Parallelization**

| **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** |
|--------------|---------------------|---------------------|---------------------------|
| Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster |
| Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline |
| Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline |
| **Paris** | **No synchronization** | **No blocking** | **Arbitrary** |

---


### Routing Strategies

- **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality.
- **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
- **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost).

![Paris Inference Pipeline](images/paris_inference.png)

*Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).*

---

# Performance Metrics

**Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)**

| **Inference Strategy** | **FID-50K ↓** |
|------------------------|---------------|
| Monolithic (single model) | 29.64 |
| Paris Top-1 | 30.60 |
| **Paris Top-2** | **22.60** |
| Paris Full Ensemble | 47.89 |

*Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.*

---

# Training Details

**Hyperparameters (DiT-XL/2)**

| **Parameter** | **Value** |
|---------------|-----------|
| Dataset | LAION-Aesthetic (11M images) |
| Clustering | DINOv2 semantic features |
| Batch Size | 16 per expert (effective 32 with 2-step accumulation) |
| Learning Rate | 2e-5 (AdamW, no scheduling) |
| Training Steps | ~120k total across experts (asynchronous) |
| EMA Decay | 0.9999 |
| Mixed Precision | FP16 with automatic loss scaling |
| Conditioning | AdaLN-Single (23% parameter reduction) |

**Router Training**

| **Parameter** | **Value** |
|---------------|-----------|
| Architecture | DiT-B (smaller than experts) |
| Batch Size | 64 with 4-step accumulation (effective 256) |
| Learning Rate | 5e-5 with cosine annealing (25 epochs) |
| Loss | Cross-entropy on cluster assignments |
| Training | Post-hoc on full dataset |


---

# Citation

```bibtex
@misc{jiang2025paris,
  title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
  author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
  year={2025},
  eprint={2510.03434},
  archivePrefix={arXiv},
  primaryClass={cs.GR},
  url={https://arxiv.org/abs/2510.03434}
}
```

---

# License

MIT License – Open for research and commercial use.

Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a>