bageldotcom
/

paris

@@ -31,11 +31,11 @@ The world's first diffusion model trained entirely through decentralized computa
 - 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
 - No gradient synchronization, parameter sharing, or activation exchange among nodes during training
-- Lightweight transformer router (~158M parameters) for dynamic expert selection
 - 11M LAION-Aesthetic images across 120 A40 GPU-days
 - 14× less training data than prior decentralized baselines
 - 16× less compute than prior decentralized baselines
-- Competitive generation quality (FID 12.45)
 - Open weights for research and commercial use under MIT license
 ---
@@ -55,7 +55,7 @@ The world's first diffusion model trained entirely through decentralized computa
 | **Model Scale** | DiT-XL/2 |
 | **Parameters per Expert** | 605M |
 | **Total Expert Parameters** | 4.84B (8 experts) |
-| **Router Parameters** | ~158M |
 | **Hidden Dimensions** | 1152 |
 | **Transformer Layers** | 28 |
 | **Attention Heads** | 16 |
@@ -94,31 +94,6 @@ This zero-communication approach enables training on fragmented compute resource
 ---
-# Usage
-```python
-from diffusers import DiffusionPipeline
-import torch
-# Load the pipeline
-pipeline = DiffusionPipeline.from_pretrained(
-    "bageldotcom/paris",
-    torch_dtype=torch.float16,
-    use_safetensors=True
-)
-pipeline.to("cuda")
-# Generate images
-images = pipeline(
-    prompt="A beautiful sunset over Paris, oil painting style",
-    num_inference_steps=50,
-    guidance_scale=7.5,
-    height=256,
-    width=256
-).images
-images[0].save("output.png")
-```
 ### Routing Strategies
@@ -156,7 +131,6 @@ images[0].save("output.png")
 | Training Steps | ~120k total across experts (asynchronous) |
 | EMA Decay | 0.9999 |
 | Mixed Precision | FP16 with automatic loss scaling |
-| Initialization | ImageNet-pretrained DiT-XL/2 |
 | Conditioning | AdaLN-Single (23% parameter reduction) |
 **Router Training**

 - 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
 - No gradient synchronization, parameter sharing, or activation exchange among nodes during training
+- Lightweight transformer router (~129M parameters) for dynamic expert selection
 - 11M LAION-Aesthetic images across 120 A40 GPU-days
 - 14× less training data than prior decentralized baselines
 - 16× less compute than prior decentralized baselines
+- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
 - Open weights for research and commercial use under MIT license
 ---
 | **Model Scale** | DiT-XL/2 |
 | **Parameters per Expert** | 605M |
 | **Total Expert Parameters** | 4.84B (8 experts) |
+| **Router Parameters** | ~129M |
 | **Hidden Dimensions** | 1152 |
 | **Transformer Layers** | 28 |
 | **Attention Heads** | 16 |
 ---
 ### Routing Strategies
 | Training Steps | ~120k total across experts (asynchronous) |
 | EMA Decay | 0.9999 |
 | Mixed Precision | FP16 with automatic loss scaling |
 | Conditioning | AdaLN-Single (23% parameter reduction) |
 **Router Training**