AbstractPhil commited on
Commit
9e7c0a4
·
verified ·
1 Parent(s): f924f65

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +426 -3
README.md CHANGED
@@ -1,3 +1,426 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # Flow Matching & Diffusion Prediction Types
5
+ ## A Practical Guide to Sol, Lune, and Epsilon Prediction
6
+
7
+ ---
8
+
9
+ ## Overview
10
+
11
+ This document covers three distinct prediction paradigms used in diffusion and flow-matching models. Each was designed for different purposes and requires specific sampling procedures.
12
+
13
+ | Model | Prediction Type | What It Learned | Output Character |
14
+ |-------|----------------|-----------------|------------------|
15
+ | **Standard SD1.5** | ε (epsilon/noise) | Remove noise | General purpose |
16
+ | **Sol** | v (velocity) via DDPM | Geometric structure | Flat silhouettes, mass placement |
17
+ | **Lune** | v (velocity) via flow | Texture and detail | Rich, detailed images |
18
+
19
+ ---
20
+
21
+ SD15-Flow-Sol (velocity prediction epsilon converted):
22
+
23
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/FeF5L08KaozTq8X4TXaTU.png)
24
+
25
+ SD15-Flow-Lune (rectified flow shift=3):
26
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/a33DpYjD_cwdfXm43SlS8.png)
27
+
28
+ TinyFlux-Lailah
29
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/9Ek_vTrMDQUA1id37Lwys.png)
30
+
31
+
32
+ ## 1. Epsilon (ε) Prediction — Standard Diffusion
33
+
34
+ ### Core Concept
35
+ > **"Predict the noise that was added"**
36
+
37
+ The model learns to identify and remove noise from corrupted images.
38
+
39
+ ### The Formula (Simplified)
40
+
41
+ ```
42
+ TRAINING:
43
+ x_noisy = √(α) * x_clean + √(1-α) * noise
44
+
45
+ Model predicts: ε̂ = "what noise was added?"
46
+
47
+ Loss = ||ε̂ - noise||²
48
+
49
+ SAMPLING:
50
+ Start with pure noise
51
+ Repeatedly ask: "what noise is in this?"
52
+ Subtract a fraction of predicted noise
53
+ Repeat until clean
54
+ ```
55
+
56
+ ### Reading the Math
57
+
58
+ - **α (alpha)**: "How much original image remains" (1 = all original, 0 = all noise)
59
+ - **√(1-α)**: "How much noise was mixed in"
60
+ - **ε**: The actual noise that was added
61
+ - **ε̂**: Model's guess of what noise was added
62
+
63
+ ### Training Process
64
+
65
+ ```python
66
+ # Forward diffusion (corruption)
67
+ noise = torch.randn_like(x_clean)
68
+ α = scheduler.alphas_cumprod[t]
69
+ x_noisy = √α * x_clean + √(1-α) * noise
70
+
71
+ # Model predicts noise
72
+ ε_pred = model(x_noisy, t)
73
+
74
+ # Loss: "Did you correctly identify the noise?"
75
+ loss = MSE(ε_pred, noise)
76
+ ```
77
+
78
+ ### Sampling Process
79
+
80
+ ```python
81
+ # DDPM/DDIM sampling
82
+ for t in reversed(timesteps): # 999 → 0
83
+ ε_pred = model(x, t)
84
+ x = scheduler.step(ε_pred, t, x) # Removes predicted noise
85
+ ```
86
+
87
+ ### Utility & Behavior
88
+
89
+ - **Strength**: General-purpose image generation
90
+ - **Weakness**: No explicit understanding of image structure
91
+ - **Use case**: Standard text-to-image generation
92
+
93
+ ---
94
+
95
+ ## 2. Velocity (v) Prediction — Sol (DDPM Framework)
96
+
97
+ ### Core Concept
98
+ > **"Predict the direction from noise to data"**
99
+
100
+ Sol predicts velocity but operates within the DDPM scheduler framework, requiring conversion from velocity to epsilon for sampling.
101
+
102
+ ### The Formula (Simplified)
103
+
104
+ ```
105
+ TRAINING:
106
+ x_t = α * x_clean + σ * noise (same as DDPM)
107
+ v = α * noise - σ * x_clean (velocity target)
108
+
109
+ Model predicts: v̂ = "which way is the image?"
110
+
111
+ Loss = ||v̂ - v||²
112
+
113
+ SAMPLING:
114
+ Convert velocity → epsilon
115
+ Use standard DDPM scheduler stepping
116
+ ```
117
+
118
+ ### Reading the Math
119
+
120
+ - **v (velocity)**: Direction vector in latent space
121
+ - **α (alpha)**: √(α_cumprod) — signal strength
122
+ - **σ (sigma)**: √(1 - α_cumprod) — noise strength
123
+ - **The velocity formula**: `v = α * ε - σ * x₀`
124
+ - "Velocity is the signal-weighted noise minus noise-weighted data"
125
+
126
+ ### Why Velocity in DDPM?
127
+
128
+ Sol was trained with David (the geometric assessor) providing loss weighting. This setup used:
129
+ - DDPM noise schedule for interpolation
130
+ - Velocity prediction for training target
131
+ - Knowledge distillation from a teacher
132
+
133
+ The result: Sol learned **geometric structure** rather than textures.
134
+
135
+ ### Training Process (David-Weighted)
136
+
137
+ ```python
138
+ # DDPM-style corruption
139
+ noise = torch.randn_like(latents)
140
+ t = torch.randint(0, 1000, (batch,))
141
+ α = sqrt(scheduler.alphas_cumprod[t])
142
+ σ = sqrt(1 - scheduler.alphas_cumprod[t])
143
+
144
+ x_t = α * latents + σ * noise
145
+
146
+ # Velocity target (NOT epsilon!)
147
+ v_target = α * noise - σ * latents
148
+
149
+ # Model predicts velocity
150
+ v_pred = model(x_t, t)
151
+
152
+ # David assesses geometric quality → adjusts loss weights
153
+ loss_weights = david_assessor(features, t)
154
+ loss = weighted_MSE(v_pred, v_target, loss_weights)
155
+ ```
156
+
157
+ ### Sampling Process (CRITICAL: v → ε conversion)
158
+
159
+ ```python
160
+ # Must convert velocity to epsilon for DDPM scheduler
161
+ scheduler = DDPMScheduler(num_train_timesteps=1000)
162
+
163
+ for t in scheduler.timesteps: # 999, 966, 933, ... → 0
164
+ v_pred = model(x, t)
165
+
166
+ # Convert velocity → epsilon
167
+ α = sqrt(scheduler.alphas_cumprod[t])
168
+ σ = sqrt(1 - scheduler.alphas_cumprod[t])
169
+
170
+ # Solve: v = α*ε - σ*x₀ and x_t = α*x₀ + σ*ε
171
+ # Result: x₀ = (α*x_t - σ*v) / (α² + σ²)
172
+ # ε = (x_t - α*x₀) / σ
173
+
174
+ x0_hat = (α * x - σ * v_pred) / (α² + σ²)
175
+ ε_hat = (x - α * x0_hat) / σ
176
+
177
+ x = scheduler.step(ε_hat, t, x) # Standard DDPM step with epsilon
178
+ ```
179
+
180
+ ### Utility & Behavior
181
+
182
+ - **What Sol learned**: Platonic forms, silhouettes, mass distribution
183
+ - **Visual output**: Flat geometric shapes, correct spatial layout, no texture
184
+ - **Why this happened**: David rewarded geometric coherence, Sol optimized for clean David classification
185
+ - **Use case**: Structural guidance, composition anchoring, "what goes where"
186
+
187
+ ### Sol's Unique Property
188
+
189
+ Sol never "collapsed" — it learned the **skeleton** of images:
190
+ - Castle prompt → Castle silhouette, horizon line, sky gradient
191
+ - Portrait prompt → Head oval, shoulder mass, figure-ground separation
192
+ - City prompt → Building masses, street perspective, light positions
193
+
194
+ This is the "WHAT before HOW" that most diffusion models skip.
195
+
196
+ ---
197
+
198
+ ## 3. Velocity (v) Prediction — Lune (Rectified Flow)
199
+
200
+ ### Core Concept
201
+ > **"Predict the straight-line direction from noise to data"**
202
+
203
+ Lune uses true rectified flow matching where data travels in straight lines through latent space.
204
+
205
+ ### The Formula (Simplified)
206
+
207
+ ```
208
+ TRAINING:
209
+ x_t = σ * noise + (1-σ) * data (linear interpolation)
210
+ v = noise - data (constant velocity)
211
+
212
+ Model predicts: v̂ = "straight line to noise"
213
+
214
+ Loss = ||v̂ - v||²
215
+
216
+ SAMPLING:
217
+ Start at σ=1 (noise)
218
+ Walk OPPOSITE to velocity (toward data)
219
+ End at σ=0 (clean image)
220
+ ```
221
+
222
+ ### Reading the Math
223
+
224
+ - **σ (sigma)**: Interpolation parameter (1 = noise, 0 = data)
225
+ - **x_t = σ·noise + (1-σ)·data**: Linear blend between noise and data
226
+ - **v = noise - data**: The velocity is CONSTANT along the path
227
+ - **Shift function**: `σ' = shift·σ / (1 + (shift-1)·σ)`
228
+ - Biases sampling toward cleaner images (spends more steps refining)
229
+
230
+ ### Key Difference from Sol
231
+
232
+ | Aspect | Sol | Lune |
233
+ |--------|-----|------|
234
+ | Interpolation | DDPM (α, σ from scheduler) | Linear (σ, 1-σ) |
235
+ | Velocity meaning | Complex (α·ε - σ·x₀) | Simple (noise - data) |
236
+ | Sampling | Convert v→ε, use scheduler | Direct Euler integration |
237
+ | Output | Geometric skeletons | Detailed images |
238
+
239
+ ### Training Process
240
+
241
+ ```python
242
+ # Linear interpolation (NOT DDPM schedule!)
243
+ noise = torch.randn_like(latents)
244
+ σ = torch.rand(batch) # Random sigma in [0, 1]
245
+
246
+ # Apply shift during training
247
+ σ_shifted = (shift * σ) / (1 + (shift - 1) * σ)
248
+ σ = σ_shifted.view(-1, 1, 1, 1)
249
+
250
+ x_t = σ * noise + (1 - σ) * latents
251
+
252
+ # Velocity target: direction FROM data TO noise
253
+ v_target = noise - latents
254
+
255
+ # Model predicts velocity
256
+ v_pred = model(x_t, σ * 1000) # Timestep = σ * 1000
257
+
258
+ loss = MSE(v_pred, v_target)
259
+ ```
260
+
261
+ ### Sampling Process (Direct Euler)
262
+
263
+ ```python
264
+ # Start from pure noise (σ = 1)
265
+ x = torch.randn(1, 4, 64, 64)
266
+
267
+ # Sigma schedule: 1 → 0 with shift
268
+ sigmas = torch.linspace(1, 0, steps + 1)
269
+ sigmas = shift_sigma(sigmas, shift=3.0)
270
+
271
+ for i in range(steps):
272
+ σ = sigmas[i]
273
+ σ_next = sigmas[i + 1]
274
+ dt = σ - σ_next # Positive (going from 1 toward 0)
275
+
276
+ timestep = σ * 1000
277
+ v_pred = model(x, timestep)
278
+
279
+ # SUBTRACT velocity (v points toward noise, we go toward data)
280
+ x = x - v_pred * dt
281
+
282
+ # x is now clean image latent
283
+ ```
284
+
285
+ ### Why SUBTRACT the Velocity?
286
+
287
+ ```
288
+ v = noise - data (points FROM data TO noise)
289
+
290
+ We want to go FROM noise TO data (opposite direction!)
291
+
292
+ So: x_new = x_current - v * dt
293
+ = x_current - (noise - data) * dt
294
+ = x_current + (data - noise) * dt ← Moving toward data ✓
295
+ ```
296
+
297
+ ### Utility & Behavior
298
+
299
+ - **What Lune learned**: Rich textures, fine details, realistic rendering
300
+ - **Visual output**: Full detailed images with lighting, materials, depth
301
+ - **Training focus**: Portrait/pose data with caption augmentation
302
+ - **Use case**: High-quality image generation, detail refinement
303
+
304
+ ---
305
+
306
+ ## Comparison Summary
307
+
308
+ ### Training Targets
309
+
310
+ ```
311
+ EPSILON (ε): target = noise
312
+ "What random noise was added?"
313
+
314
+ VELOCITY (Sol): target = α·noise - σ·data
315
+ "What's the DDPM-weighted direction?"
316
+
317
+ VELOCITY (Lune): target = noise - data
318
+ "What's the straight-line direction?"
319
+ ```
320
+
321
+ ### Sampling Directions
322
+
323
+ ```
324
+ EPSILON: x_new = scheduler.step(ε_pred, t, x)
325
+ Scheduler handles noise removal internally
326
+
327
+ VELOCITY (Sol): Convert v → ε, then scheduler.step(ε, t, x)
328
+ Must translate to epsilon for DDPM math
329
+
330
+ VELOCITY (Lune): x_new = x - v_pred * dt
331
+ Direct Euler integration, subtract velocity
332
+ ```
333
+
334
+ ### Visual Intuition
335
+
336
+ ```
337
+ EPSILON:
338
+ "There's noise hiding the image"
339
+ "I'll predict and remove the noise layer by layer"
340
+ → General-purpose denoising
341
+
342
+ VELOCITY (Sol):
343
+ "I know which direction the image is"
344
+ "But I speak through DDPM's noise schedule"
345
+ → Learned structure, outputs skeletons
346
+
347
+ VELOCITY (Lune):
348
+ "Straight line from noise to image"
349
+ "I'll walk that line step by step"
350
+ → Learned detail, outputs rich images
351
+ ```
352
+
353
+ ---
354
+
355
+ ## Practical Implementation Checklist
356
+
357
+ ### For Epsilon Models (Standard SD1.5)
358
+ - [ ] Use DDPM/DDIM/Euler scheduler
359
+ - [ ] Pass timestep as integer [0, 999]
360
+ - [ ] Scheduler handles everything
361
+
362
+ ### For Sol (Velocity + DDPM)
363
+ - [ ] Use DDPMScheduler
364
+ - [ ] Model outputs velocity, NOT epsilon
365
+ - [ ] Convert: `x0 = (α·x - σ·v) / (α² + σ²)`, then `ε = (x - α·x0) / σ`
366
+ - [ ] Call `scheduler.step(ε, t, x)`
367
+ - [ ] Expect geometric/structural output
368
+
369
+ ### For Lune (Velocity + Flow)
370
+ - [ ] NO scheduler needed — direct Euler
371
+ - [ ] Sigma goes 1 → 0 (not 0 → 1!)
372
+ - [ ] Apply shift: `σ' = shift·σ / (1 + (shift-1)·σ)`
373
+ - [ ] Timestep to model: `σ * 1000`
374
+ - [ ] SUBTRACT velocity: `x = x - v * dt`
375
+ - [ ] Expect detailed textured output
376
+
377
+ ---
378
+
379
+ ## Why This Matters for TinyFlux
380
+
381
+ TinyFlux can leverage both experts:
382
+
383
+ 1. **Sol (early timesteps)**: Provides geometric anchoring
384
+ - "Where should the castle be?"
385
+ - "What's the horizon line?"
386
+ - "How is mass distributed?"
387
+
388
+ 2. **Lune (mid/late timesteps)**: Provides detail refinement
389
+ - "What texture is the stone?"
390
+ - "How does light fall?"
391
+ - "What color is the sky?"
392
+
393
+ By combining geometric structure (Sol) with textural detail (Lune), TinyFlux can achieve better composition AND quality than either alone.
394
+
395
+ ---
396
+
397
+ ## Quick Reference Card
398
+
399
+ ```
400
+ ┌─────────────────────────────────────────────────────────────┐
401
+ │ PREDICTION TYPES │
402
+ ├─────────────────────────────────────────────────────────────┤
403
+ │ EPSILON (ε) │
404
+ │ Train: target = noise │
405
+ │ Sample: scheduler.step(ε_pred, t, x) │
406
+ │ Output: General images │
407
+ ├─────────────────────────────────────────────────────────────┤
408
+ │ VELOCITY - SOL (DDPM framework) │
409
+ │ Train: target = α·ε - σ·x₀ │
410
+ │ Sample: v→ε conversion, then scheduler.step(ε, t, x) │
411
+ │ Output: Geometric skeletons │
412
+ ├─────────────────────────────────────────────────────────────┤
413
+ │ VELOCITY - LUNE (Rectified Flow) │
414
+ │ Train: target = noise - data │
415
+ │ Sample: x = x - v·dt (Euler, σ: 1→0) │
416
+ │ Output: Detailed textured images │
417
+ └─────────────────────────────────────────────────────────────┘
418
+ ```
419
+
420
+ ---
421
+
422
+ *Document Version: 1.0*
423
+ *Last Updated: January 2026*
424
+ *Authors: AbstractPhil & Claude OPUS 4.5*
425
+
426
+ License: MIT