--- license: mit tags: - text-to-image - diffusion - multi-expert - dit - laion - distributed - decentralized - flow-matching --- Bagel Labs

Paris: A Decentralized Trained Open-Weight Diffusion Model

The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/paris/blob/main/paper.pdf) to learn more. # Key Characteristics - 8 independently trained expert diffusion models (605M parameters each, 4.84B total) - No gradient synchronization, parameter sharing, or activation exchange among nodes during training - Lightweight transformer router (~129M parameters) for dynamic expert selection - 11M LAION-Aesthetic images across 120 A40 GPU-days - 14× less training data than prior decentralized baselines - 16× less compute than prior decentralized baselines - Competitive generation quality (FID 12.45 on DiTExpert XL/2) - Open weights for research and commercial use under MIT license --- # Examples ![Paris Generation Examples](images/generated_images.png) *Text-conditioned image generation samples using Paris across diverse prompts and visual styles* --- # Architecture Details | Component | Specification | |-----------|--------------| | **Model Scale** | DiT-XL/2 | | **Parameters per Expert** | 605M | | **Total Expert Parameters** | 4.84B (8 experts) | | **Router Parameters** | ~129M | | **Hidden Dimensions** | 1152 | | **Transformer Layers** | 28 | | **Attention Heads** | 16 | | **Patch Size** | 2×2 (latent space) | | **Latent Resolution** | 32×32×4 | | **Image Resolution** | 256×256 | | **Text Conditioning** | CLIP ViT-L/14 | | **VAE** | sd-vae-ft-mse (8× downsampling) | --- # Training Approach Paris implements fully decentralized training where: - Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering) - No gradient synchronization, parameter sharing, or activation exchange between experts during training - Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds - Router trained post-hoc on full dataset for expert selection during inference - Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink) ![Training Architecture](images/training_architecture.png) *Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.* This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training. **Comparison with Traditional Parallelization** | **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** | |--------------|---------------------|---------------------|---------------------------| | Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster | | Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline | | Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline | | **Paris** | **No synchronization** | **No blocking** | **Arbitrary** | --- ### Routing Strategies - **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality. - **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost. - **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost). ![Paris Inference Pipeline](images/paris_inference.png) *Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).* --- # Performance Metrics **Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)** | **Inference Strategy** | **FID-50K ↓** | |------------------------|---------------| | Monolithic (single model) | 29.64 | | Paris Top-1 | 30.60 | | **Paris Top-2** | **22.60** | | Paris Full Ensemble | 47.89 | *Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.* --- # Training Details **Hyperparameters (DiT-XL/2)** | **Parameter** | **Value** | |---------------|-----------| | Dataset | LAION-Aesthetic (11M images) | | Clustering | DINOv2 semantic features | | Batch Size | 16 per expert (effective 32 with 2-step accumulation) | | Learning Rate | 2e-5 (AdamW, no scheduling) | | Training Steps | ~120k total across experts (asynchronous) | | EMA Decay | 0.9999 | | Mixed Precision | FP16 with automatic loss scaling | | Conditioning | AdaLN-Single (23% parameter reduction) | **Router Training** | **Parameter** | **Value** | |---------------|-----------| | Architecture | DiT-B (smaller than experts) | | Batch Size | 64 with 4-step accumulation (effective 256) | | Learning Rate | 5e-5 with cosine annealing (25 epochs) | | Loss | Cross-entropy on cluster assignments | | Training | Post-hoc on full dataset | --- # Citation ```bibtex @misc{jiang2025paris, title={Paris: A Decentralized Trained Open-Weight Diffusion Model}, author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan}, year={2025}, eprint={2510.03434}, archivePrefix={arXiv}, primaryClass={cs.GR}, url={https://arxiv.org/abs/2510.03434} } ``` --- # License MIT License – Open for research and commercial use. Made with ❤️ by