Bagel Labs commited on
Commit
85d1eb3
·
verified ·
1 Parent(s): 15a8def

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -29
README.md CHANGED
@@ -31,11 +31,11 @@ The world's first diffusion model trained entirely through decentralized computa
31
 
32
  - 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
33
  - No gradient synchronization, parameter sharing, or activation exchange among nodes during training
34
- - Lightweight transformer router (~158M parameters) for dynamic expert selection
35
  - 11M LAION-Aesthetic images across 120 A40 GPU-days
36
  - 14× less training data than prior decentralized baselines
37
  - 16× less compute than prior decentralized baselines
38
- - Competitive generation quality (FID 12.45)
39
  - Open weights for research and commercial use under MIT license
40
 
41
  ---
@@ -55,7 +55,7 @@ The world's first diffusion model trained entirely through decentralized computa
55
  | **Model Scale** | DiT-XL/2 |
56
  | **Parameters per Expert** | 605M |
57
  | **Total Expert Parameters** | 4.84B (8 experts) |
58
- | **Router Parameters** | ~158M |
59
  | **Hidden Dimensions** | 1152 |
60
  | **Transformer Layers** | 28 |
61
  | **Attention Heads** | 16 |
@@ -94,31 +94,6 @@ This zero-communication approach enables training on fragmented compute resource
94
 
95
  ---
96
 
97
- # Usage
98
-
99
- ```python
100
- from diffusers import DiffusionPipeline
101
- import torch
102
-
103
- # Load the pipeline
104
- pipeline = DiffusionPipeline.from_pretrained(
105
- "bageldotcom/paris",
106
- torch_dtype=torch.float16,
107
- use_safetensors=True
108
- )
109
- pipeline.to("cuda")
110
-
111
- # Generate images
112
- images = pipeline(
113
- prompt="A beautiful sunset over Paris, oil painting style",
114
- num_inference_steps=50,
115
- guidance_scale=7.5,
116
- height=256,
117
- width=256
118
- ).images
119
-
120
- images[0].save("output.png")
121
- ```
122
 
123
  ### Routing Strategies
124
 
@@ -156,7 +131,6 @@ images[0].save("output.png")
156
  | Training Steps | ~120k total across experts (asynchronous) |
157
  | EMA Decay | 0.9999 |
158
  | Mixed Precision | FP16 with automatic loss scaling |
159
- | Initialization | ImageNet-pretrained DiT-XL/2 |
160
  | Conditioning | AdaLN-Single (23% parameter reduction) |
161
 
162
  **Router Training**
 
31
 
32
  - 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
33
  - No gradient synchronization, parameter sharing, or activation exchange among nodes during training
34
+ - Lightweight transformer router (~129M parameters) for dynamic expert selection
35
  - 11M LAION-Aesthetic images across 120 A40 GPU-days
36
  - 14× less training data than prior decentralized baselines
37
  - 16× less compute than prior decentralized baselines
38
+ - Competitive generation quality (FID 12.45 on DiTExpert XL/2)
39
  - Open weights for research and commercial use under MIT license
40
 
41
  ---
 
55
  | **Model Scale** | DiT-XL/2 |
56
  | **Parameters per Expert** | 605M |
57
  | **Total Expert Parameters** | 4.84B (8 experts) |
58
+ | **Router Parameters** | ~129M |
59
  | **Hidden Dimensions** | 1152 |
60
  | **Transformer Layers** | 28 |
61
  | **Attention Heads** | 16 |
 
94
 
95
  ---
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
  ### Routing Strategies
99
 
 
131
  | Training Steps | ~120k total across experts (asynchronous) |
132
  | EMA Decay | 0.9999 |
133
  | Mixed Precision | FP16 with automatic loss scaling |
 
134
  | Conditioning | AdaLN-Single (23% parameter reduction) |
135
 
136
  **Router Training**