AbstractPhil
/

tiny-flux-deep

@@ -21,234 +21,934 @@ datasets:
 - AbstractPhil/imagenet-synthetic
 ---
-# TinyFlux-Deep (Lailah)
-**TinyFlux-Lailah** is an expanded TinyFlux architecture with increased depth and width. Originally ported from [TinyFlux](https://huggingface.co/AbstractPhil/tiny-flux) with strategic layer expansion and attention head doubling, now training end-to-end on teacher latents.
-## Quick Start (Colab)
-The easiest way to test Lailah:
-1. Open [Google Colab](https://colab.research.google.com/)
-2. Copy the contents of [`inference_v3.py`](./inference_v3.py) and [`model_v3.py`](./model_v3.py)
-3. Run the cells
 ```python
-# Or fetch directly:
-!wget https://huggingface.co/AbstractPhil/tiny-flux-deep/raw/main/inference_v3.py
-%run inference_v3.py
-```
-## Fair Weights
-### ImageNet Synthetic step_346875
-* Handles multiple animal combination variants with high fidelity
-https://huggingface.co/datasets/AbstractPhil/imagenet-synthetic
-"subject, animal, cat, photograph of a tiger, natural habitat"
-![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/uJ9Ffh780iLgEIJhmafod.png)
-"subject, bird, blue beak, red eyes, green claws"
-![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/GRS5tyaFFa0HV2xSJCsin.png)
-"subject, bird, red haired bird in a tree"
-![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/rGourHokJsPtYNnoFi3Eq.png)
-![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/O_z6DLc32HDNBq3ZwEjqf.png)
 ## Architecture
-| Component | TinyFlux | TinyFlux-Lailah | Flux |
-|-----------|----------|-----------------|------|
-| Hidden size | 256 | **512** | 3072 |
-| Attention heads | 2 | **4** | 24 |
-| Head dimension | 128 | 128 | 128 |
-| Double-stream layers | 3 | **15** | 19 |
-| Single-stream layers | 3 | **25** | 38 |
-| VAE channels | 16 | 16 | 16 |
-| **Total params** | ~10.7M | **~241.8M** | ~12B |
-### Text Encoders
-| Role | Model | Dimension |
-|------|-------|-----------|
-| Sequence encoder | flan-t5-base | 768 |
-| Pooled encoder | CLIP-L | 768 |
-## Training
-### Current Approach
-All parameters are trainable. The model was initially ported from TinyFlux with frozen anchor layers, but current training runs with everything unfrozen for maximum flexibility.
-### Dataset
-Training on [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents):
-- Pre-computed VAE latents from Flux-Schnell generations
-- 512×512 resolution (64×64 latent space)
-- Diverse prompts covering people, objects, scenes, styles
-### Training Details
-- **Objective**: Flow matching (rectified flow)
-- **Timestep sampling**: Logit-normal with Flux shift (s=3.0)
-- **Loss weighting**: Min-SNR-γ (γ=5.0)
-- **Optimizer**: AdamW (lr=3e-4, β=(0.9, 0.99), wd=0.01)
-- **Schedule**: Cosine with warmup
-- **Precision**: bfloat16
-- **Batch size**: 32 (16 × 2 gradient accumulation)
-- **EMA decay**: 0.9999
-### Checkpoints
-Checkpoints are saved every epoch or so with both main and EMA weights:
-- `checkpoints/step_XXXXX.safetensors` - Training weights
-- `checkpoints/step_XXXXX_ema.safetensors` - EMA weights (currently very broken and retraining, use standard step to inference)
-## Usage
-### Dependencies
-```bash
-pip install torch transformers diffusers safetensors huggingface_hub
 ```
-### Basic Inference
 ```python
 import torch
 from huggingface_hub import hf_hub_download
 from safetensors.torch import load_file
-# Load model (requires TinyFluxDeep class from tinyflux_deep.py)
-config = TinyFluxDeepConfig()
-model = TinyFluxDeep(config).to("cuda", torch.bfloat16)
-# Load EMA weights (broken) or main weights
-weights = load_file(hf_hub_download(
-    "AbstractPhil/tiny-flux-deep",
-    "checkpoints/step_286250_ema.safetensors"  # EMA will be better later, for now it's broken.
-))
 model.load_state_dict(weights, strict=False)
 model.eval()
 ```
-### Sampling
-Lailah uses Euler discrete sampling with Flux timestep shift:
 ```python
-def flux_shift(t, s=3.0):
-    """Bias timesteps toward data (higher t)."""
-    return s * t / (1 + (s - 1) * t)
-# 20-50 steps recommended
-timesteps = flux_shift(torch.linspace(0, 1, num_steps + 1))
-for i in range(num_steps):
-    t_curr, t_next = timesteps[i], timesteps[i + 1]
-    dt = t_next - t_curr
-    v = model(hidden_states=x, encoder_hidden_states=t5_out, ...)
-    x = x + v * dt  # Euler step
 ```
-### Configuration
 ```python
-@dataclass
-class TinyFluxDeepConfig:
-    hidden_size: int = 512
-    num_attention_heads: int = 4
-    attention_head_dim: int = 128
-    in_channels: int = 16
-    joint_attention_dim: int = 768
-    pooled_projection_dim: int = 768
-    num_double_layers: int = 15
-    num_single_layers: int = 25
-    mlp_ratio: float = 4.0
-    axes_dims_rope: Tuple[int, int, int] = (16, 56, 56)
-    guidance_embeds: bool = True
 ```
-## Files
 ```
 AbstractPhil/tiny-flux-deep/
-├── model.safetensors              # Latest best weights
-├── tinyflux_deep.py               # Model architecture
-├── colab_inference_lailah_early.py # Ready-to-run Colab inference
-├── inference_tinyflux_deep.py     # Standalone inference script
-├── train_tinyflux_deep.py         # Training script
-├── checkpoints/
-│   ├── step_286250.safetensors    # Training weights
-│   └── step_286250_ema.safetensors # EMA weights (currently broken)
-├── samples/                        # Generated samples during training
-└── README.md
 ```
-## Origin: Porting from TinyFlux
-Lailah was initialized by porting TinyFlux weights:
-1. **Attention head expansion** (2 → 4): Original heads copied to positions 0-1, new heads 2-3 Xavier initialized
-2. **Hidden dimension expansion** (256 → 512): Weights tiled and scaled
-3. **Layer distribution**: Original 3 layers distributed across 15/25 positions as initialization anchors
-The initial port used selective freezing of anchor layers, but current training leaves all parameters unfrozen.
-## Comparison
-| Aspect | TinyFlux | Lailah | Full Flux |
-|--------|----------|--------|-----------|
-| Parameters | 10.7M | 241.8M | 12B |
-| Memory (bf16) | ~22MB | ~484MB | ~24GB |
-| Quality | Limited | Moderate | High |
-| Speed (A100) | ~10ms | ~40ms | ~200ms |
-## Limitations
-- **Resolution**: 512×512 only (64×64 latent)
-- **Early training**: Quality improving but not production-ready
-- **Text capacity**: Limited by flan-t5-base (768 dim vs Flux's 4096)
-- **Experimental**: Research model, expect artifacts
-## Intended Use
-- Rapid prototyping and iteration
-- Studying flow matching at moderate scale
-- Architecture experiments
-- Educational purposes
-- Baseline comparisons
 ## Name
-**Lailah** (לילה) - Angel of the night in Jewish tradition, said to guard souls. Chosen for this model's role as a smaller guardian exploring the same space as larger models.
 ## Citation
 ```bibtex
-@misc{tinyfluxlailah2026,
-  title={TinyFlux-Lailah: Compact Flow Matching for Text-to-Image},
   author={AbstractPhil},
   year={2026},
-  url={https://huggingface.co/AbstractPhil/tiny-flux-deep}
 }
 ```
-## Related
-- [AbstractPhil/tiny-flux](https://huggingface.co/AbstractPhil/tiny-flux) - Base model (10.7M)
-- [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents) - Training data
-- [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) - Teacher model
 ## License
-MIT License
 ---
-**Status**: Active training. Checkpoints updated regularly. Use standard weights for best results.

 - AbstractPhil/imagenet-synthetic
 ---
+# TinyFlux-Deep v4.1 (Lailah)
+A compact **246M parameter** flow-matching diffusion model that distills knowledge from multiple teacher models into an efficient architecture. TinyFlux-Deep uses a dual expert system to capture both trajectory dynamics (from SD1.5) and structural attention patterns (from a geometric prior), enabling high-quality image generation at a fraction of the compute cost of full-scale models.
+## Table of Contents
+- [Key Features](#key-features)
+- [Quick Start](#quick-start)
+- [Architecture](#architecture)
+- [Dual Expert System](#dual-expert-system)
+- [Configuration](#configuration)
+- [Inference](#inference)
+- [Training](#training)
+- [Checkpoint Conversion](#checkpoint-conversion)
+- [Repository Structure](#repository-structure)
+- [API Reference](#api-reference)
+- [Samples](#samples)
+- [Limitations](#limitations)
+- [Citation](#citation)
+---
+## Key Features
+| Feature | Description |
+|---------|-------------|
+| **Compact Size** | 246M params (~500MB bf16) vs Flux's 12B (~24GB) |
+| **Dual Expert Distillation** | Learns from both SD1.5 trajectory features and geometric attention priors |
+| **Flow Matching** | Rectified flow objective with Flux-style timestep shifting |
+| **T5 + CLIP Conditioning** | Dual text encoder pathway with learnable balance |
+| **Huber Loss** | Robust training that handles outliers gracefully |
+| **Identity-Init Conversion** | v3→v4 conversion preserves pretrained weights exactly |
+---
+## Quick Start
+### Colab Inference
 ```python
+!pip install torch transformers safetensors huggingface_hub accelerate
+import torch
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+# Download model code and weights
+model_py = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/model_v4.py")
+weights = hf_hub_download("AbstractPhil/tiny-flux-deep", "model.safetensors")
+# Load model
+exec(open(model_py).read())
+config = TinyFluxConfig()
+model = TinyFluxDeep(config).to("cuda", torch.bfloat16)
+model.load_state_dict(load_file(weights), strict=False)
+model.eval()
+# For full inference pipeline with text encoders and sampling:
+inference_py = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/inference_v3.py")
+exec(open(inference_py).read())
+# Then call: image = generate("your prompt here")
+```
+### Minimal Generation Loop
+```python
+import torch
+import torch.nn.functional as F
+def flux_shift(t, s=3.0):
+    """Flux-style timestep shifting - biases toward data end."""
+    return s * t / (1 + (s - 1) * t)
+def generate(model, t5_emb, clip_emb, clip_pooled, num_steps=25, cfg_scale=4.0):
+    """Euler sampling with classifier-free guidance."""
+    device = next(model.parameters()).device
+    dtype = next(model.parameters()).dtype
+    # Start from noise
+    x = torch.randn(1, 64*64, 16, device=device, dtype=dtype)
+    img_ids = TinyFluxDeep.create_img_ids(1, 64, 64, device)
+    # Timesteps with Flux shift
+    timesteps = flux_shift(torch.linspace(1, 0, num_steps + 1, device=device))
+    for i in range(num_steps):
+        t_curr = timesteps[i]
+        t_next = timesteps[i + 1]
+        dt = t_next - t_curr
+        t_batch = t_curr.expand(1)
+        # Conditional prediction
+        v_cond = model(
+            hidden_states=x,
+            encoder_hidden_states=t5_emb,
+            pooled_projections=clip_pooled,
+            timestep=t_batch,
+            img_ids=img_ids,
+        )
+        # Unconditional prediction (for CFG)
+        v_uncond = model(
+            hidden_states=x,
+            encoder_hidden_states=torch.zeros_like(t5_emb),
+            pooled_projections=torch.zeros_like(clip_pooled),
+            timestep=t_batch,
+            img_ids=img_ids,
+        )
+        # Classifier-free guidance
+        v = v_uncond + cfg_scale * (v_cond - v_uncond)
+        # Euler step
+        x = x + v * dt
+    return x  # [1, 4096, 16] - decode with VAE
+```
+---
 ## Architecture
+### Model Comparison
+| Component | TinyFlux | TinyFlux-Deep v3 | TinyFlux-Deep v4.1 | Flux-Schnell |
+|-----------|----------|------------------|--------------------| -------------|
+| Hidden size | 256 | 512 | 512 | 3072 |
+| Attention heads | 2 | 4 | 4 | 24 |
+| Head dimension | 128 | 128 | 128 | 128 |
+| Double-stream layers | 3 | 15 | 15 | 19 |
+| Single-stream layers | 3 | 25 | 25 | 38 |
+| MLP ratio | 4.0 | 4.0 | 4.0 | 4.0 |
+| RoPE dims | (16,56,56) | (16,56,56) | (16,56,56) | (16,56,56) |
+| Lune Expert | ✗ | ✓ | ✓ | ✗ |
+| Sol Attention Prior | ✗ | ✗ | ✓ | ✗ |
+| T5 Vec Enhancement | ✗ | ✗ | ✓ | ✗ |
+| **Total Parameters** | ~10.7M | ~244.7M | ~246.4M | ~12B |
+| **Memory (bf16)** | ~22MB | ~490MB | ~493MB | ~24GB |
+### Block Structure
+**Double-Stream Blocks (15 layers):**
+- Separate text and image pathways
+- Joint attention between modalities
+- AdaLN-Zero conditioning from vec
+- Sol spatial modulation on image Q/K only
+**Single-Stream Blocks (25 layers):**
+- Concatenated text + image sequence
+- Full self-attention with RoPE
+- Sol modulation skips text tokens
+```
+Input: img_latents [B, 4096, 16], t5_emb [B, 77, 768], clip_pooled [B, 768]
+                                    │
+                    ┌───────────────┴───────────────┐
+                    ▼                               ▼
+              img_in (Linear)                 txt_in (Linear)
+                    │                               │
+                    ▼                               ▼
+              [B, 4096, 512]                  [B, 77, 512]
+                    │                               │
+                    └───────────┬───────────────────┘
+                                │
+            vec = time_emb + clip_vec + t5_vec + lune_signal
+                                │
+                    ┌───────────┴───────────┐
+                    ▼                       ▼
+              Double Blocks (×15)     Sol Prior → temperature, spatial_mod
+                    │                       │
+                    ▼                       │
+              Single Blocks (×25) ◄─────────┘
+                    │
+                    ▼
+              final_norm → final_linear
+                    │
+                    ▼
+              Output: [B, 4096, 16]
+```
+---
+## Dual Expert System
+TinyFlux-Deep v4.1 uses two complementary expert pathways to inject knowledge from teacher models without the "twin-tail paradox" (mixing incompatible prediction targets).
+### Lune Expert Predictor (Trajectory Guidance)
+**Purpose:** Captures SD1.5's understanding of "how denoising should flow" - the trajectory through latent space.
+**Architecture:**
+```python
+LuneExpertPredictor(
+    time_dim=512,        # From timestep MLP
+    clip_dim=768,        # CLIP pooled features
+    expert_dim=1280,     # SD1.5 mid-block dimension (prediction target)
+    hidden_dim=512,      # Internal MLP width
+    output_dim=512,      # Output added to vec
+    dropout=0.1,
+)
+```
+**How it works:**
+1. Concatenates timestep embedding + CLIP pooled → hidden
+2. Predicts what SD1.5's mid-block features would be
+3. During training: uses real SD1.5 features when available
+4. During inference: uses predicted features
+5. Gates output with learnable sigmoid (init 0.5)
+6. Adds `expert_signal` to global `vec` conditioning
+**Training signal:** Cosine similarity loss against real SD1.5 UNet mid-block features (soft directional matching, not exact reconstruction).
+### Sol Attention Prior (Structural Guidance)
+**Purpose:** Captures geometric/structural knowledge about WHERE attention should focus, without injecting incompatible features.
+**Key insight:** Sol (a V-prediction DDPM model) has valuable attention patterns, but its features are incompatible with TinyFlux's linear flow matching. We extract attention *statistics* instead:
+- **Locality:** How local vs global is attention?
+- **Entropy:** How focused vs diffuse?
+- **Clustering:** How structured vs uniform?
+- **Spatial importance:** Which regions matter most?
+**Architecture:**
+```python
+SolAttentionPrior(
+    time_dim=512,
+    clip_dim=768,
+    hidden_dim=256,
+    num_heads=4,           # Matches TinyFlux attention heads
+    spatial_size=8,        # 8×8 importance map
+    geometric_weight=0.7,  # David's 70/30 split
+)
+```
+**How it works:**
+1. **Geometric prior (70%):** Timestep-based heuristics
+   - Early denoising (high t): Higher temperature → softer, global attention
+   - Late denoising (low t): Lower temperature → sharper, local attention
+   - Spatial: Uniform early, center-biased late
+2. **Learned prior (30%):** Content-based predictions
+   - Predicts attention statistics from (timestep, CLIP)
+   - Predicts spatial importance map
+3. **Blending:** `blend * geometric + (1-blend) * learned` with learnable blend gate
+4. **Output application:**
+   - `temperature [B, 4]`: Scales attention logits per head
+   - `spatial_mod [B, H, W]`: Modulates Q/K at each position via `exp(conv(spatial))`
+**Identity initialization:** All Sol components initialize to zero-effect:
+- `spatial_to_mod` Conv2d: zero weight, zero bias → `exp(0) = 1` (identity)
+- Allows gradual learning without disrupting pretrained attention
+### T5 Vec Enhancement
+**Purpose:** Adds T5's semantic understanding to the global conditioning pathway (previously only CLIP pooled).
+```python
+# Attention-weighted pooling of T5 sequence
+t5_attn = softmax(t5_emb.mean(dim=-1))  # [B, 77]
+t5_pooled = (t5_emb * t5_attn.unsqueeze(-1)).sum(dim=1)  # [B, 768]
+t5_vec = t5_pool_mlp(t5_pooled)  # [B, 512]
+# Learnable balance between CLIP and T5
+balance = sigmoid(text_balance)  # Initialized to 0.5
+text_vec = balance * clip_vec + (1 - balance) * t5_vec
+```
+---
+## Configuration
+### TinyFluxConfig
+```python
+from dataclasses import dataclass
+from typing import Tuple
+@dataclass
+class TinyFluxConfig:
+    # Core architecture
+    hidden_size: int = 512
+    num_attention_heads: int = 4
+    attention_head_dim: int = 128  # hidden_size = heads × head_dim
+    in_channels: int = 16          # VAE latent channels
+    patch_size: int = 1
+    joint_attention_dim: int = 768  # T5 embedding dim
+    pooled_projection_dim: int = 768  # CLIP pooled dim
+    num_double_layers: int = 15
+    num_single_layers: int = 25
+    mlp_ratio: float = 4.0
+    axes_dims_rope: Tuple[int, int, int] = (16, 56, 56)  # Must sum to head_dim
+    # Lune expert predictor
+    use_lune_expert: bool = True
+    lune_expert_dim: int = 1280    # SD1.5 mid-block dim
+    lune_hidden_dim: int = 512
+    lune_dropout: float = 0.1
+    # Sol attention prior
+    use_sol_prior: bool = True
+    sol_spatial_size: int = 8      # 8×8 spatial importance map
+    sol_hidden_dim: int = 256
+    sol_geometric_weight: float = 0.7  # 70% geometric, 30% learned
+    # T5 enhancement
+    use_t5_vec: bool = True
+    t5_pool_mode: str = "attention"  # "attention", "mean", "cls"
+    # Loss configuration
+    lune_distill_mode: str = "cosine"  # "hard", "soft", "cosine", "huber"
+    use_huber_loss: bool = True
+    huber_delta: float = 0.1
+    # Legacy compatibility
+    guidance_embeds: bool = False
 ```
+### Loading from JSON
+```python
+# From file
+config = TinyFluxConfig.from_json("lailah_401434_v4_config.json")
+# From dict
+config = TinyFluxConfig.from_dict({
+    "hidden_size": 512,
+    "num_attention_heads": 4,
+    ...
+})
+# Save with metadata
+config.save_json("config.json", metadata={"source_step": 401434})
+```
+### Validation
+```python
+# Config validates constraints on creation
+config = TinyFluxConfig(hidden_size=512, num_attention_heads=4, attention_head_dim=128)
+# ✓ OK: 512 = 4 × 128
+config = TinyFluxConfig(hidden_size=512, num_attention_heads=4, attention_head_dim=64)
+# ✗ ValueError: hidden_size (512) must equal num_attention_heads * attention_head_dim (256)
+# Validate checkpoint compatibility
+warnings = config.validate_checkpoint(state_dict)
+if warnings:
+    print("Warnings:", warnings)
+```
+---
+## Inference
+### Full Pipeline
 ```python
 import torch
+from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
+from diffusers import AutoencoderKL
 from huggingface_hub import hf_hub_download
 from safetensors.torch import load_file
+# Load text encoders
+t5_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
+t5_model = T5EncoderModel.from_pretrained("google/flan-t5-base").to("cuda", torch.bfloat16)
+clip_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
+clip_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda", torch.bfloat16)
+# Load VAE
+vae = AutoencoderKL.from_pretrained(
+    "black-forest-labs/FLUX.1-schnell",
+    subfolder="vae",
+    torch_dtype=torch.bfloat16
+).to("cuda")
+# Load TinyFlux-Deep
+model_py = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/model_v4.py")
+exec(open(model_py).read())
+config = TinyFluxConfig()
+model = TinyFluxDeep(config).to("cuda", torch.bfloat16)
+weights = load_file(hf_hub_download("AbstractPhil/tiny-flux-deep", "model.safetensors"))
 model.load_state_dict(weights, strict=False)
 model.eval()
+def encode_prompt(prompt):
+    """Encode prompt with both T5 and CLIP."""
+    # T5
+    t5_tokens = t5_tokenizer(prompt, return_tensors="pt", padding="max_length",
+                              max_length=77, truncation=True).to("cuda")
+    with torch.no_grad():
+        t5_emb = t5_model(**t5_tokens).last_hidden_state.to(torch.bfloat16)
+    # CLIP
+    clip_tokens = clip_tokenizer(prompt, return_tensors="pt", padding="max_length",
+                                  max_length=77, truncation=True).to("cuda")
+    with torch.no_grad():
+        clip_out = clip_model(**clip_tokens)
+        clip_pooled = clip_out.pooler_output.to(torch.bfloat16)
+    return t5_emb, clip_pooled
+def generate_image(prompt, num_steps=25, cfg_scale=4.0, seed=None):
+    """Generate image from text prompt."""
+    if seed is not None:
+        torch.manual_seed(seed)
+    t5_emb, clip_pooled = encode_prompt(prompt)
+    # Null embeddings for CFG
+    t5_null, clip_null = encode_prompt("")
+    # Start from noise
+    x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
+    img_ids = TinyFluxDeep.create_img_ids(1, 64, 64, "cuda")
+    # Flux-shifted timesteps
+    def flux_shift(t, s=3.0):
+        return s * t / (1 + (s - 1) * t)
+    timesteps = flux_shift(torch.linspace(1, 0, num_steps + 1, device="cuda"))
+    with torch.no_grad():
+        for i in range(num_steps):
+            t = timesteps[i].expand(1)
+            dt = timesteps[i + 1] - timesteps[i]
+            # Conditional
+            v_cond = model(x, t5_emb, clip_pooled, t, img_ids)
+            # Unconditional
+            v_uncond = model(x, t5_null, clip_null, t, img_ids)
+            # CFG
+            v = v_uncond + cfg_scale * (v_cond - v_uncond)
+            # Euler step
+            x = x + v * dt
+    # Decode with VAE
+    x = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)  # [B, C, H, W]
+    x = x / vae.config.scaling_factor
+    with torch.no_grad():
+        image = vae.decode(x).sample
+    # Convert to PIL
+    image = (image / 2 + 0.5).clamp(0, 1)
+    image = image[0].permute(1, 2, 0).cpu().float().numpy()
+    image = (image * 255).astype("uint8")
+    from PIL import Image
+    return Image.fromarray(image)
+# Generate!
+image = generate_image("a photograph of a tiger in natural habitat", seed=42)
+image.save("tiger.png")
 ```
+### Batch Inference
+```python
+def generate_batch(prompts, **kwargs):
+    """Generate multiple images."""
+    return [generate_image(p, **kwargs) for p in prompts]
+images = generate_batch([
+    "a red bird with blue beak",
+    "a mountain landscape at sunset",
+    "an astronaut riding a horse",
+], num_steps=25, cfg_scale=4.0)
+```
+---
+## Training
+### Loss Computation
 ```python
+# Forward pass with expert info
+output, expert_info = model(
+    hidden_states=noisy_latents,
+    encoder_hidden_states=t5_emb,
+    pooled_projections=clip_pooled,
+    timestep=timesteps,
+    img_ids=img_ids,
+    lune_features=sd15_midblock_features,  # From SD1.5 teacher
+    sol_stats=sol_attention_stats,          # From Sol teacher (optional)
+    sol_spatial=sol_spatial_importance,     # From Sol teacher (optional)
+    return_expert_pred=True,
+)
+# Compute loss
+losses = model.compute_loss(
+    output=output,
+    target=flow_target,  # data - noise for flow matching
+    expert_info=expert_info,
+    lune_features=sd15_midblock_features,
+    sol_stats=sol_attention_stats,
+    sol_spatial=sol_spatial_importance,
+    # Loss weights
+    lune_weight=0.1,      # Weight for Lune distillation
+    sol_weight=0.05,      # Weight for Sol distillation
+    # Loss options
+    use_huber=True,       # Huber loss for main objective (robust to outliers)
+    huber_delta=0.1,      # Huber delta (smaller = tighter MSE region)
+    lune_distill_mode="cosine",  # "hard", "soft", "cosine", "huber"
+    spatial_weighting=True,  # Weight loss by Sol spatial importance
+)
+# losses dict contains:
+# - main: flow matching loss
+# - lune_distill: Lune prediction loss
+# - sol_stat_distill: Sol statistics prediction loss
+# - sol_spatial_distill: Sol spatial prediction loss
+# - total: weighted sum
+loss = losses['total']
+loss.backward()
+```
+### Distillation Modes
+| Mode | Description | Use Case |
+|------|-------------|----------|
+| `"hard"` | MSE against teacher features | Exact reconstruction |
+| `"soft"` | Temperature-scaled MSE | Softer matching |
+| `"cosine"` | Cosine similarity loss | Directional alignment (recommended) |
+| `"huber"` | Huber loss on features | Robust to outliers |
+### Training Loop Example
+```python
+from torch.optim import AdamW
+from torch.cuda.amp import autocast, GradScaler
+optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.99), weight_decay=0.01)
+scaler = GradScaler()
+# EMA
+ema_decay = 0.9999
+ema_model = copy.deepcopy(model)
+for step, batch in enumerate(dataloader):
+    optimizer.zero_grad()
+    with autocast(dtype=torch.bfloat16):
+        # Sample timesteps with logit-normal distribution
+        u = torch.randn(batch_size, device=device)
+        t = torch.sigmoid(u)  # Logit-normal
+        t = flux_shift(t, s=3.0)  # Flux shift
+        # Add noise
+        noise = torch.randn_like(batch['latents'])
+        noisy = t.view(-1,1,1) * batch['latents'] + (1-t.view(-1,1,1)) * noise
+        target = batch['latents'] - noise  # Flow matching target
+        # Forward
+        output, expert_info = model(
+            hidden_states=noisy,
+            encoder_hidden_states=batch['t5_emb'],
+            pooled_projections=batch['clip_pooled'],
+            timestep=t,
+            img_ids=img_ids,
+            lune_features=batch.get('sd15_features'),
+            return_expert_pred=True,
+        )
+        # Loss
+        losses = model.compute_loss(output, target, expert_info,
+                                     lune_features=batch.get('sd15_features'))
+    scaler.scale(losses['total']).backward()
+    scaler.step(optimizer)
+    scaler.update()
+    # EMA update
+    with torch.no_grad():
+        for p, p_ema in zip(model.parameters(), ema_model.parameters()):
+            p_ema.data.lerp_(p.data, 1 - ema_decay)
 ```
+### Hyperparameters
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| Optimizer | AdamW | |
+| Learning rate | 3e-4 | With cosine schedule |
+| Betas | (0.9, 0.99) | |
+| Weight decay | 0.01 | |
+| Batch size | 32 | 16 × 2 gradient accumulation |
+| EMA decay | 0.9999 | |
+| Precision | bfloat16 | |
+| Timestep shift | s=3.0 | Flux-style |
+| Timestep sampling | Logit-normal | |
+| Lune weight | 0.1 | |
+| Sol weight | 0.05 | |
+| Huber delta | 0.1 | |
+---
+## Checkpoint Conversion
+### v3 → v4.1 Conversion
+The converter preserves all pretrained weights and initializes new v4.1 components to identity/zero-effect:
+**What gets converted:**
+| v3 Key | v4.1 Key | Action |
+|--------|----------|--------|
+| `expert_predictor.*` | `lune_predictor.*` | Rename |
+| `expert_gate` (0.5) | `expert_gate` (0.0) | Convert to logit space |
+| - | `sol_prior.*` | Initialize (zero-effect) |
+| - | `t5_pool.*` | Initialize (Xavier) |
+| - | `text_balance` | Initialize (0.0 = 50/50) |
+| - | `*.spatial_to_mod.*` | Initialize (zero = identity) |
+**Parameter growth:**
+- v3: ~244.7M parameters
+- v4.1: ~246.4M parameters
+- Added: ~1.7M parameters (0.7% increase)
+### Python API
 ```python
+from huggingface_hub import hf_hub_download
+# Download converter
+converter = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/convert_v3_to_v4.py")
+exec(open(converter).read())
+# Simple: download, convert, upload
+from convert_v3_to_v4 import run
+result = run(401434)  # Step number
+# With custom config
+result = run(401434, config={
+    "hidden_size": 512,
+    "num_attention_heads": 4,
+    "sol_geometric_weight": 0.8,  # More geometric, less learned
+})
+# From JSON config file
+result = run(401434, config="my_config.json")
+# Low-level API
+from convert_v3_to_v4 import convert_state_dict, analyze_checkpoint, TinyFluxConfig
+# Analyze checkpoint version
+state_dict = load_file("checkpoint.safetensors")
+info = analyze_checkpoint(state_dict)
+print(f"Version: {info.version}")  # "v3", "v4.0", "v4.1", etc.
+print(f"Has Sol prior: {info.has_sol_prior}")
+# Convert state dict
+config = TinyFluxConfig()
+v4_state, report = convert_state_dict(state_dict, config)
+print(f"Renamed {len(report['renamed'])} keys")
+print(f"Initialized {len(report['initialized'])} keys")
+```
+### CLI
+```bash
+# Basic conversion
+python convert_v3_to_v4.py --step 401434
+# Local file
+python convert_v3_to_v4.py --input model_v3.safetensors
+# Analyze only (don't convert)
+python convert_v3_to_v4.py --step 401434 --analyze-only
+# Custom output
+python convert_v3_to_v4.py --step 401434 --output-dir my_converted --name mymodel
+# With custom config
+python convert_v3_to_v4.py --step 401434 --config my_config.json
+```
+### Output Structure
+```
+checkpoint_runs/v4_init/
+├── lailah_401434_v4_init.safetensors           # Converted model
+├── lailah_401434_v4_init_ema.safetensors       # Fresh EMA (copy of model)
+├── lailah_401434_v4_init_ema_secondary.safetensors  # Converted old EMA
+└── lailah_401434_v4_config.json                # Config with conversion metadata
+```
+### Config JSON Format
+```json
+{
+  "hidden_size": 512,
+  "num_attention_heads": 4,
+  "attention_head_dim": 128,
+  "num_double_layers": 15,
+  "num_single_layers": 25,
+  "use_lune_expert": true,
+  "use_sol_prior": true,
+  "use_t5_vec": true,
+  "sol_geometric_weight": 0.7,
+  "lune_distill_mode": "cosine",
+  "use_huber_loss": true,
+  "huber_delta": 0.1,
+  "_conversion_info": {
+    "source_step": 401434,
+    "source_repo": "AbstractPhil/tiny-flux-deep",
+    "source_version": "v3",
+    "target_version": "v4.1",
+    "source_params": 244690849,
+    "target_params": 246347234,
+    "params_added": 1656385,
+    "converter_version": "4.1.0"
+  }
+}
 ```
+---
+## Repository Structure
 ```
 AbstractPhil/tiny-flux-deep/
+│
+├── model.safetensors                          # Latest training weights
+├── model_ema.safetensors                      # EMA weights (use for inference)
+├── config.json                                # Model configuration
+├── README.md
+│
+├── scripts/                                   # All Python code
+│   ├── model_v4.py                            # v4.1 architecture (current)
+│   ├── model_v3.py                            # v3 architecture (reference)
+│   ├── model_v2.py                            # v2 architecture (legacy)
+│   ├── inference_v3.py                        # Full inference pipeline
+│   ├── convert_v3_to_v4.py                    # Checkpoint converter
+│   ├── trainer_v3_expert_guidance.py          # Training with distillation
+│   ├── trainer_v2.py                          # Previous trainer
+│   ├── trainer.py                             # Original trainer
+│   ├── port_tiny_to_deep.py                   # TinyFlux → Deep port script
+│   └── colab_inference_lailah_early.py        # Simple Colab notebook
+│
+├── checkpoints/                               # v3 checkpoints (legacy)
+│   ├── step_401434.safetensors
+│   └── step_401434_ema.safetensors
+│
+├── checkpoint_runs/                           # Organized checkpoint runs
+│   └── v4_init/                               # v4.1 initialization from v3
+│       ├── lailah_401434_v4_init.safetensors
+│       ├── lailah_401434_v4_init_ema.safetensors
+│       ├── lailah_401434_v4_init_ema_secondary.safetensors
+│       └── lailah_401434_v4_config.json
+│
+├── samples/                                   # Generated samples per step
+│   └── 20260127_074318_step_401434.png
+│
+└── logs/                                      # TensorBoard training logs
+    └── run_20260126_220714/
 ```
+---
+## API Reference
+### TinyFluxDeep
+```python
+class TinyFluxDeep(nn.Module):
+    def __init__(self, config: Optional[TinyFluxConfig] = None):
+        """Initialize model with config (uses defaults if None)."""
+    def forward(
+        self,
+        hidden_states: torch.Tensor,        # [B, N, 16] image latents
+        encoder_hidden_states: torch.Tensor, # [B, L, 768] T5 embeddings
+        pooled_projections: torch.Tensor,    # [B, 768] CLIP pooled
+        timestep: torch.Tensor,              # [B] timestep in [0, 1]
+        img_ids: torch.Tensor,               # [N, 3] position IDs
+        txt_ids: Optional[torch.Tensor] = None,
+        guidance: Optional[torch.Tensor] = None,  # Legacy
+        lune_features: Optional[torch.Tensor] = None,  # [B, 1280] SD1.5 features
+        sol_stats: Optional[torch.Tensor] = None,      # [B, 3] attention stats
+        sol_spatial: Optional[torch.Tensor] = None,    # [B, 8, 8] spatial importance
+        expert_features: Optional[torch.Tensor] = None,  # Legacy API
+        return_expert_pred: bool = False,
+    ) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict]]:
+        """
+        Forward pass.
+        Returns:
+            output: [B, N, 16] predicted velocity
+            expert_info: dict with predictions (if return_expert_pred=True)
+        """
+    def compute_loss(
+        self,
+        output: torch.Tensor,
+        target: torch.Tensor,
+        expert_info: Optional[Dict] = None,
+        lune_features: Optional[torch.Tensor] = None,
+        sol_stats: Optional[torch.Tensor] = None,
+        sol_spatial: Optional[torch.Tensor] = None,
+        lune_weight: float = 0.1,
+        sol_weight: float = 0.05,
+        use_huber: bool = True,
+        huber_delta: float = 0.1,
+        lune_distill_mode: str = "cosine",
+        spatial_weighting: bool = True,
+    ) -> Dict[str, torch.Tensor]:
+        """Compute combined loss with distillation."""
+    @staticmethod
+    def create_img_ids(batch_size: int, height: int, width: int, device) -> torch.Tensor:
+        """Create image position IDs for RoPE."""
+    @staticmethod
+    def create_txt_ids(text_len: int, device) -> torch.Tensor:
+        """Create text position IDs."""
+    def count_parameters(self) -> Dict[str, int]:
+        """Count parameters by component."""
+```
+### Converter Functions
+```python
+# High-level
+def run(step, name="lailah", config=None, ...):
+    """One-liner: download, convert, upload."""
+def convert_checkpoint(step=None, input_path=None, config=None, ...) -> ConversionResult:
+    """Convert checkpoint with full control."""
+# Low-level
+def analyze_checkpoint(state_dict) -> CheckpointInfo:
+    """Analyze checkpoint version and contents."""
+def convert_state_dict(state_dict, config=None) -> Tuple[Dict, Dict]:
+    """Convert state dict, return (new_state, report)."""
+def download_from_hf(step, repo_id, ...) -> Tuple[str, str]:
+    """Download checkpoint from HuggingFace."""
+# Config
+class TinyFluxConfig:
+    def to_dict(self) -> Dict
+    def from_dict(cls, d) -> TinyFluxConfig
+    def from_json(cls, path) -> TinyFluxConfig
+    def save_json(self, path, metadata=None)
+    def validate_checkpoint(self, state_dict) -> List[str]
+```
+---
+## Samples
+### Step 401434 (v3 weights)
+**"subject, animal, cat, photograph of a tiger, natural habitat"**
+![tiger](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/uJ9Ffh780iLgEIJhmafod.png)
+**"subject, bird, blue beak, red eyes, green claws"**
+![bird1](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/GRS5tyaFFa0HV2xSJCsin.png)
+**"subject, bird, red haired bird in a tree"**
+![bird2](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/rGourHokJsPtYNnoFi3Eq.png)
+---
+## Limitations
+| Limitation | Details |
+|------------|---------|
+| **Resolution** | 512×512 only (64×64 latent space) |
+| **Text encoder** | flan-t5-base (768 dim) vs Flux's T5-XXL (4096 dim) |
+| **Attention heads** | 4 heads vs Flux's 24 - limits capacity |
+| **Training data** | Teacher latents, not real images |
+| **v4.1 status** | New architecture, training just starting |
+| **Artifacts** | Expect imperfections - research model |
+---
 ## Name
+**Lailah** (לילה) — In Jewish tradition, the angel of the night who guards souls and teaches wisdom to the unborn. Chosen for this model's role as a smaller guardian exploring the same latent space as larger models, learning from their knowledge while finding its own path.
+---
 ## Citation
 ```bibtex
+@misc{tinyfluxdeep2026,
+  title={TinyFlux-Deep: Compact Flow Matching with Dual Expert Distillation},
   author={AbstractPhil},
   year={2026},
+  howpublished={\url{https://huggingface.co/AbstractPhil/tiny-flux-deep}},
+  note={246M parameter text-to-image model with Lune trajectory guidance and Sol attention priors}
 }
 ```
+---
+## Related Projects
+| Project | Description |
+|---------|-------------|
+| [AbstractPhil/tiny-flux](https://huggingface.co/AbstractPhil/tiny-flux) | Original TinyFlux (10.7M params) |
+| [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents) | Training dataset |
+| [AbstractPhil/imagenet-synthetic](https://huggingface.co/datasets/AbstractPhil/imagenet-synthetic) | ImageNet-style synthetic data |
+| [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | Teacher model |
+---
 ## License
+MIT License - free for research and commercial use.
 ---
+**Status**: v4.1 architecture complete. Converting v3 checkpoints and resuming training with dual expert distillation.