File size: 31,459 Bytes

---
license: mit
language:
- en
tags:
- diffusion
- flow-matching
- flux
- text-to-image
- image-generation
- tinyflux
- lailah
- experimental
library_name: pytorch
pipeline_tag: text-to-image
base_model:
- AbstractPhil/tiny-flux
- black-forest-labs/FLUX.1-schnell
datasets:
- AbstractPhil/flux-schnell-teacher-latents
- AbstractPhil/imagenet-synthetic
---

# TinyFlux-Deep v4.1 (Lailah)

A compact **246M parameter** flow-matching diffusion model that distills knowledge from multiple teacher models into an efficient architecture. TinyFlux-Deep uses a dual expert system to capture both trajectory dynamics (from SD1.5) and structural attention patterns (from a geometric prior), enabling high-quality image generation at a fraction of the compute cost of full-scale models.

## Table of Contents

- [Key Features](#key-features)
- [Quick Start](#quick-start)
- [Architecture](#architecture)
- [Dual Expert System](#dual-expert-system)
- [Configuration](#configuration)
- [Inference](#inference)
- [Training](#training)
- [Checkpoint Conversion](#checkpoint-conversion)
- [Repository Structure](#repository-structure)
- [API Reference](#api-reference)
- [Samples](#samples)
- [Limitations](#limitations)
- [Citation](#citation)

---

## Key Features

| Feature | Description |
|---------|-------------|
| **Compact Size** | 246M params (~500MB bf16) vs Flux's 12B (~24GB) |
| **Dual Expert Distillation** | Learns from both SD1.5 trajectory features and geometric attention priors |
| **Flow Matching** | Rectified flow objective with Flux-style timestep shifting |
| **T5 + CLIP Conditioning** | Dual text encoder pathway with learnable balance |
| **Huber Loss** | Robust training that handles outliers gracefully |
| **Identity-Init Conversion** | v3→v4 conversion preserves pretrained weights exactly |

---

## Quick Start

### Colab Inference

```python
!pip install torch transformers safetensors huggingface_hub accelerate

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Download model code and weights
model_py = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/model_v4.py")
weights = hf_hub_download("AbstractPhil/tiny-flux-deep", "model.safetensors")

# Load model
exec(open(model_py).read())
config = TinyFluxConfig()
model = TinyFluxDeep(config).to("cuda", torch.bfloat16)
model.load_state_dict(load_file(weights), strict=False)
model.eval()

# For full inference pipeline with text encoders and sampling:
inference_py = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/inference_v3.py")
exec(open(inference_py).read())
# Then call: image = generate("your prompt here")
```

### Minimal Generation Loop

```python
import torch
import torch.nn.functional as F

def flux_shift(t, s=3.0):
    """Flux-style timestep shifting - biases toward data end."""
    return s * t / (1 + (s - 1) * t)

def generate(model, t5_emb, clip_emb, clip_pooled, num_steps=25, cfg_scale=4.0):
    """Euler sampling with classifier-free guidance."""
    device = next(model.parameters()).device
    dtype = next(model.parameters()).dtype
    
    # Start from pure noise (t=0)
    x = torch.randn(1, 64*64, 16, device=device, dtype=dtype)
    img_ids = TinyFluxDeep.create_img_ids(1, 64, 64, device)
    
    # Rectified flow: integrate from t=0 (noise) to t=1 (data)
    timesteps = flux_shift(torch.linspace(0, 1, num_steps + 1, device=device))
    
    for i in range(num_steps):
        t_curr = timesteps[i]
        t_next = timesteps[i + 1]
        dt = t_next - t_curr
        
        t_batch = t_curr.expand(1)
        
        # Conditional prediction
        v_cond = model(
            hidden_states=x,
            encoder_hidden_states=t5_emb,
            pooled_projections=clip_pooled,
            timestep=t_batch,
            img_ids=img_ids,
        )
        
        # Unconditional prediction (for CFG)
        v_uncond = model(
            hidden_states=x,
            encoder_hidden_states=torch.zeros_like(t5_emb),
            pooled_projections=torch.zeros_like(clip_pooled),
            timestep=t_batch,
            img_ids=img_ids,
        )
        
        # Classifier-free guidance
        v = v_uncond + cfg_scale * (v_cond - v_uncond)
        
        # Euler step
        x = x + v * dt
    
    return x  # [1, 4096, 16] - decode with VAE
```

---

## Architecture

### Model Comparison

| Component | TinyFlux | TinyFlux-Deep v3 | TinyFlux-Deep v4.1 | Flux-Schnell |
|-----------|----------|------------------|--------------------| -------------|
| Hidden size | 256 | 512 | 512 | 3072 |
| Attention heads | 2 | 4 | 4 | 24 |
| Head dimension | 128 | 128 | 128 | 128 |
| Double-stream layers | 3 | 15 | 15 | 19 |
| Single-stream layers | 3 | 25 | 25 | 38 |
| MLP ratio | 4.0 | 4.0 | 4.0 | 4.0 |
| RoPE dims | (16,56,56) | (16,56,56) | (16,56,56) | (16,56,56) |
| Lune Expert | ✗ | ✓ | ✓ | ✗ |
| Sol Attention Prior | ✗ | ✗ | ✓ | ✗ |
| T5 Vec Enhancement | ✗ | ✗ | ✓ | ✗ |
| **Total Parameters** | ~10.7M | ~244.7M | ~246.4M | ~12B |
| **Memory (bf16)** | ~22MB | ~490MB | ~493MB | ~24GB |

### Block Structure

**Double-Stream Blocks (15 layers):**
- Separate text and image pathways
- Joint attention between modalities
- AdaLN-Zero conditioning from vec
- Sol spatial modulation on image Q/K only

**Single-Stream Blocks (25 layers):**
- Concatenated text + image sequence
- Full self-attention with RoPE
- Sol modulation skips text tokens

```
Input: img_latents [B, 4096, 16], t5_emb [B, 77, 768], clip_pooled [B, 768]
                                    │
                    ┌───────────────┴───────────────┐
                    ▼                               ▼
              img_in (Linear)                 txt_in (Linear)
                    │                               │
                    ▼                               ▼
              [B, 4096, 512]                  [B, 77, 512]
                    │                               │
                    └───────────┬───────────────────┘
                                │
            vec = time_emb + clip_vec + t5_vec + lune_signal
                                │
                    ┌───────────┴───────────┐
                    ▼                       ▼
              Double Blocks (×15)     Sol Prior → temperature, spatial_mod
                    │                       │
                    ▼                       │
              Single Blocks (×25) ◄─────────┘
                    │
                    ▼
              final_norm → final_linear
                    │
                    ▼
              Output: [B, 4096, 16]
```

---

## Dual Expert System

TinyFlux-Deep v4.1 uses two complementary expert pathways to inject knowledge from teacher models without the "twin-tail paradox" (mixing incompatible prediction targets).

### Lune Expert Predictor (Trajectory Guidance)

**Purpose:** Captures SD1.5's understanding of "how denoising should flow" - the trajectory through latent space.

**Architecture:**
```python
LuneExpertPredictor(
    time_dim=512,        # From timestep MLP
    clip_dim=768,        # CLIP pooled features
    expert_dim=1280,     # SD1.5 mid-block dimension (prediction target)
    hidden_dim=512,      # Internal MLP width
    output_dim=512,      # Output added to vec
    dropout=0.1,
)
```

**How it works:**
1. Concatenates timestep embedding + CLIP pooled → hidden
2. Predicts what SD1.5's mid-block features would be
3. During training: uses real SD1.5 features when available
4. During inference: uses predicted features
5. Gates output with learnable sigmoid (init 0.5)
6. Adds `expert_signal` to global `vec` conditioning

**Training signal:** Cosine similarity loss against real SD1.5 UNet mid-block features (soft directional matching, not exact reconstruction).

### Sol Attention Prior (Structural Guidance)

**Purpose:** Captures geometric/structural knowledge about WHERE attention should focus, without injecting incompatible features.

**Key insight:** Sol (a V-prediction DDPM model) has valuable attention patterns, but its features are incompatible with TinyFlux's linear flow matching. We extract attention *statistics* instead:
- **Locality:** How local vs global is attention?
- **Entropy:** How focused vs diffuse?
- **Clustering:** How structured vs uniform?
- **Spatial importance:** Which regions matter most?

**Architecture:**
```python
SolAttentionPrior(
    time_dim=512,
    clip_dim=768,
    hidden_dim=256,
    num_heads=4,           # Matches TinyFlux attention heads
    spatial_size=8,        # 8×8 importance map
    geometric_weight=0.7,  # David's 70/30 split
)
```

**How it works:**
1. **Geometric prior (70%):** Timestep-based heuristics
   - Early denoising (high t): Higher temperature → softer, global attention
   - Late denoising (low t): Lower temperature → sharper, local attention
   - Spatial: Uniform early, center-biased late

2. **Learned prior (30%):** Content-based predictions
   - Predicts attention statistics from (timestep, CLIP)
   - Predicts spatial importance map

3. **Blending:** `blend * geometric + (1-blend) * learned` with learnable blend gate

4. **Output application:**
   - `temperature [B, 4]`: Scales attention logits per head
   - `spatial_mod [B, H, W]`: Modulates Q/K at each position via `exp(conv(spatial))`

**Identity initialization:** All Sol components initialize to zero-effect:
- `spatial_to_mod` Conv2d: zero weight, zero bias → `exp(0) = 1` (identity)
- Allows gradual learning without disrupting pretrained attention

### T5 Vec Enhancement

**Purpose:** Adds T5's semantic understanding to the global conditioning pathway (previously only CLIP pooled).

```python
# Attention-weighted pooling of T5 sequence
t5_attn = softmax(t5_emb.mean(dim=-1))  # [B, 77]
t5_pooled = (t5_emb * t5_attn.unsqueeze(-1)).sum(dim=1)  # [B, 768]
t5_vec = t5_pool_mlp(t5_pooled)  # [B, 512]

# Learnable balance between CLIP and T5
balance = sigmoid(text_balance)  # Initialized to 0.5
text_vec = balance * clip_vec + (1 - balance) * t5_vec
```

---

## Configuration

### TinyFluxConfig

```python
from dataclasses import dataclass
from typing import Tuple

@dataclass
class TinyFluxConfig:
    # Core architecture
    hidden_size: int = 512
    num_attention_heads: int = 4
    attention_head_dim: int = 128  # hidden_size = heads × head_dim
    in_channels: int = 16          # VAE latent channels
    patch_size: int = 1
    joint_attention_dim: int = 768  # T5 embedding dim
    pooled_projection_dim: int = 768  # CLIP pooled dim
    num_double_layers: int = 15
    num_single_layers: int = 25
    mlp_ratio: float = 4.0
    axes_dims_rope: Tuple[int, int, int] = (16, 56, 56)  # Must sum to head_dim
    
    # Lune expert predictor
    use_lune_expert: bool = True
    lune_expert_dim: int = 1280    # SD1.5 mid-block dim
    lune_hidden_dim: int = 512
    lune_dropout: float = 0.1
    
    # Sol attention prior
    use_sol_prior: bool = True
    sol_spatial_size: int = 8      # 8×8 spatial importance map
    sol_hidden_dim: int = 256
    sol_geometric_weight: float = 0.7  # 70% geometric, 30% learned
    
    # T5 enhancement
    use_t5_vec: bool = True
    t5_pool_mode: str = "attention"  # "attention", "mean", "cls"
    
    # Loss configuration
    lune_distill_mode: str = "cosine"  # "hard", "soft", "cosine", "huber"
    use_huber_loss: bool = True
    huber_delta: float = 0.1
    
    # Legacy compatibility
    guidance_embeds: bool = False
```

### Loading from JSON

```python
# From file
config = TinyFluxConfig.from_json("lailah_401434_v4_config.json")

# From dict
config = TinyFluxConfig.from_dict({
    "hidden_size": 512,
    "num_attention_heads": 4,
    ...
})

# Save with metadata
config.save_json("config.json", metadata={"source_step": 401434})
```

### Validation

```python
# Config validates constraints on creation
config = TinyFluxConfig(hidden_size=512, num_attention_heads=4, attention_head_dim=128)
# ✓ OK: 512 = 4 × 128

config = TinyFluxConfig(hidden_size=512, num_attention_heads=4, attention_head_dim=64)
# ✗ ValueError: hidden_size (512) must equal num_attention_heads * attention_head_dim (256)

# Validate checkpoint compatibility
warnings = config.validate_checkpoint(state_dict)
if warnings:
    print("Warnings:", warnings)
```

---

## Inference

### Full Pipeline

```python
import torch
from transformers import T5EncoderModel, T5Tokenizer, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Load text encoders
t5_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
t5_model = T5EncoderModel.from_pretrained("google/flan-t5-base").to("cuda", torch.bfloat16)

clip_tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
clip_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda", torch.bfloat16)

# Load VAE
vae = AutoencoderKL.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", 
    subfolder="vae",
    torch_dtype=torch.bfloat16
).to("cuda")

# Load TinyFlux-Deep
model_py = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/model_v4.py")
exec(open(model_py).read())

config = TinyFluxConfig()
model = TinyFluxDeep(config).to("cuda", torch.bfloat16)
weights = load_file(hf_hub_download("AbstractPhil/tiny-flux-deep", "model.safetensors"))
model.load_state_dict(weights, strict=False)
model.eval()

def encode_prompt(prompt):
    """Encode prompt with both T5 and CLIP."""
    # T5
    t5_tokens = t5_tokenizer(prompt, return_tensors="pt", padding="max_length", 
                              max_length=77, truncation=True).to("cuda")
    with torch.no_grad():
        t5_emb = t5_model(**t5_tokens).last_hidden_state.to(torch.bfloat16)
    
    # CLIP
    clip_tokens = clip_tokenizer(prompt, return_tensors="pt", padding="max_length",
                                  max_length=77, truncation=True).to("cuda")
    with torch.no_grad():
        clip_out = clip_model(**clip_tokens)
        clip_pooled = clip_out.pooler_output.to(torch.bfloat16)
    
    return t5_emb, clip_pooled

def generate_image(prompt, num_steps=25, cfg_scale=4.0, seed=None):
    """
    Euler sampling for rectified flow.
    
    Flow: x_t = (1-t)*noise + t*data
    Integrate from t=0 (noise) to t=1 (data)
    """
    if seed is not None:
        torch.manual_seed(seed)
    
    t5_emb, clip_pooled = encode_prompt(prompt)
    
    # Null embeddings for CFG
    t5_null, clip_null = encode_prompt("")
    
    # Start from pure noise (t=0)
    x = torch.randn(1, 64*64, 16, device="cuda", dtype=torch.bfloat16)
    img_ids = TinyFluxDeep.create_img_ids(1, 64, 64, "cuda")
    
    # Rectified flow: 0 → 1 with Flux shift
    def flux_shift(t, s=3.0):
        return s * t / (1 + (s - 1) * t)
    
    timesteps = flux_shift(torch.linspace(0, 1, num_steps + 1, device="cuda"))
    
    with torch.no_grad():
        for i in range(num_steps):
            t = timesteps[i].expand(1)
            dt = timesteps[i + 1] - timesteps[i]  # Positive
            
            # Conditional
            v_cond = model(x, t5_emb, clip_pooled, t, img_ids)
            
            # Unconditional
            v_uncond = model(x, t5_null, clip_null, t, img_ids)
            
            # CFG
            v = v_uncond + cfg_scale * (v_cond - v_uncond)
            
            # Euler step
            x = x + v * dt
    
    # Decode with VAE
    x = x.reshape(1, 64, 64, 16).permute(0, 3, 1, 2)  # [B, C, H, W]
    x = x / vae.config.scaling_factor
    with torch.no_grad():
        image = vae.decode(x).sample
    
    # Convert to PIL
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image[0].permute(1, 2, 0).cpu().float().numpy()
    image = (image * 255).astype("uint8")
    
    from PIL import Image
    return Image.fromarray(image)

# Generate!
image = generate_image("a photograph of a tiger in natural habitat", seed=42)
image.save("tiger.png")
```

### Batch Inference

```python
def generate_batch(prompts, **kwargs):
    """Generate multiple images."""
    return [generate_image(p, **kwargs) for p in prompts]

images = generate_batch([
    "a red bird with blue beak",
    "a mountain landscape at sunset",
    "an astronaut riding a horse",
], num_steps=25, cfg_scale=4.0)
```

---

## Training

### Loss Computation

```python
# Forward pass with expert info
output, expert_info = model(
    hidden_states=noisy_latents,
    encoder_hidden_states=t5_emb,
    pooled_projections=clip_pooled,
    timestep=timesteps,
    img_ids=img_ids,
    lune_features=sd15_midblock_features,  # From SD1.5 teacher
    sol_stats=sol_attention_stats,          # From Sol teacher (optional)
    sol_spatial=sol_spatial_importance,     # From Sol teacher (optional)
    return_expert_pred=True,
)

# Compute loss
losses = model.compute_loss(
    output=output,
    target=flow_target,  # data - noise for flow matching
    expert_info=expert_info,
    lune_features=sd15_midblock_features,
    sol_stats=sol_attention_stats,
    sol_spatial=sol_spatial_importance,
    
    # Loss weights
    lune_weight=0.1,      # Weight for Lune distillation
    sol_weight=0.05,      # Weight for Sol distillation
    
    # Loss options
    use_huber=True,       # Huber loss for main objective (robust to outliers)
    huber_delta=0.1,      # Huber delta (smaller = tighter MSE region)
    lune_distill_mode="cosine",  # "hard", "soft", "cosine", "huber"
    spatial_weighting=True,  # Weight loss by Sol spatial importance
)

# losses dict contains:
# - main: flow matching loss
# - lune_distill: Lune prediction loss
# - sol_stat_distill: Sol statistics prediction loss
# - sol_spatial_distill: Sol spatial prediction loss
# - total: weighted sum

loss = losses['total']
loss.backward()
```

### Distillation Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| `"hard"` | MSE against teacher features | Exact reconstruction |
| `"soft"` | Temperature-scaled MSE | Softer matching |
| `"cosine"` | Cosine similarity loss | Directional alignment (recommended) |
| `"huber"` | Huber loss on features | Robust to outliers |

### Training Loop Example

```python
from torch.optim import AdamW
from torch.cuda.amp import autocast, GradScaler

optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.99), weight_decay=0.01)
scaler = GradScaler()

# EMA
ema_decay = 0.9999
ema_model = copy.deepcopy(model)

for step, batch in enumerate(dataloader):
    optimizer.zero_grad()
    
    with autocast(dtype=torch.bfloat16):
        # Sample timesteps with logit-normal distribution
        u = torch.randn(batch_size, device=device)
        t = torch.sigmoid(u)  # Logit-normal
        t = flux_shift(t, s=3.0)  # Flux shift
        
        # Add noise
        noise = torch.randn_like(batch['latents'])
        noisy = t.view(-1,1,1) * batch['latents'] + (1-t.view(-1,1,1)) * noise
        target = batch['latents'] - noise  # Flow matching target
        
        # Forward
        output, expert_info = model(
            hidden_states=noisy,
            encoder_hidden_states=batch['t5_emb'],
            pooled_projections=batch['clip_pooled'],
            timestep=t,
            img_ids=img_ids,
            lune_features=batch.get('sd15_features'),
            return_expert_pred=True,
        )
        
        # Loss
        losses = model.compute_loss(output, target, expert_info, 
                                     lune_features=batch.get('sd15_features'))
    
    scaler.scale(losses['total']).backward()
    scaler.step(optimizer)
    scaler.update()
    
    # EMA update
    with torch.no_grad():
        for p, p_ema in zip(model.parameters(), ema_model.parameters()):
            p_ema.data.lerp_(p.data, 1 - ema_decay)
```

### Hyperparameters

| Parameter | Value | Notes |
|-----------|-------|-------|
| Optimizer | AdamW | |
| Learning rate | 3e-4 | With cosine schedule |
| Betas | (0.9, 0.99) | |
| Weight decay | 0.01 | |
| Batch size | 32 | 16 × 2 gradient accumulation |
| EMA decay | 0.9999 | |
| Precision | bfloat16 | |
| Timestep shift | s=3.0 | Flux-style |
| Timestep sampling | Logit-normal | |
| Lune weight | 0.1 | |
| Sol weight | 0.05 | |
| Huber delta | 0.1 | |

---

## Checkpoint Conversion

### v3 → v4.1 Conversion

The converter preserves all pretrained weights and initializes new v4.1 components to identity/zero-effect:

**What gets converted:**
| v3 Key | v4.1 Key | Action |
|--------|----------|--------|
| `expert_predictor.*` | `lune_predictor.*` | Rename |
| `expert_gate` (0.5) | `expert_gate` (0.0) | Convert to logit space |
| - | `sol_prior.*` | Initialize (zero-effect) |
| - | `t5_pool.*` | Initialize (Xavier) |
| - | `text_balance` | Initialize (0.0 = 50/50) |
| - | `*.spatial_to_mod.*` | Initialize (zero = identity) |

**Parameter growth:**
- v3: ~244.7M parameters
- v4.1: ~246.4M parameters  
- Added: ~1.7M parameters (0.7% increase)

### Python API

```python
from huggingface_hub import hf_hub_download

# Download converter
converter = hf_hub_download("AbstractPhil/tiny-flux-deep", "scripts/convert_v3_to_v4.py")
exec(open(converter).read())

# Simple: download, convert, upload
from convert_v3_to_v4 import run
result = run(401434)  # Step number

# With custom config
result = run(401434, config={
    "hidden_size": 512,
    "num_attention_heads": 4,
    "sol_geometric_weight": 0.8,  # More geometric, less learned
})

# From JSON config file
result = run(401434, config="my_config.json")

# Low-level API
from convert_v3_to_v4 import convert_state_dict, analyze_checkpoint, TinyFluxConfig

# Analyze checkpoint version
state_dict = load_file("checkpoint.safetensors")
info = analyze_checkpoint(state_dict)
print(f"Version: {info.version}")  # "v3", "v4.0", "v4.1", etc.
print(f"Has Sol prior: {info.has_sol_prior}")

# Convert state dict
config = TinyFluxConfig()
v4_state, report = convert_state_dict(state_dict, config)
print(f"Renamed {len(report['renamed'])} keys")
print(f"Initialized {len(report['initialized'])} keys")
```

### CLI

```bash
# Basic conversion
python convert_v3_to_v4.py --step 401434

# Local file
python convert_v3_to_v4.py --input model_v3.safetensors

# Analyze only (don't convert)
python convert_v3_to_v4.py --step 401434 --analyze-only

# Custom output
python convert_v3_to_v4.py --step 401434 --output-dir my_converted --name mymodel

# With custom config
python convert_v3_to_v4.py --step 401434 --config my_config.json
```

### Output Structure

```
checkpoint_runs/v4_init/
├── lailah_401434_v4_init.safetensors           # Converted model
├── lailah_401434_v4_init_ema.safetensors       # Fresh EMA (copy of model)
├── lailah_401434_v4_init_ema_secondary.safetensors  # Converted old EMA
└── lailah_401434_v4_config.json                # Config with conversion metadata
```

### Config JSON Format

```json
{
  "hidden_size": 512,
  "num_attention_heads": 4,
  "attention_head_dim": 128,
  "num_double_layers": 15,
  "num_single_layers": 25,
  "use_lune_expert": true,
  "use_sol_prior": true,
  "use_t5_vec": true,
  "sol_geometric_weight": 0.7,
  "lune_distill_mode": "cosine",
  "use_huber_loss": true,
  "huber_delta": 0.1,
  "_conversion_info": {
    "source_step": 401434,
    "source_repo": "AbstractPhil/tiny-flux-deep",
    "source_version": "v3",
    "target_version": "v4.1",
    "source_params": 244690849,
    "target_params": 246347234,
    "params_added": 1656385,
    "converter_version": "4.1.0"
  }
}
```

---

## Repository Structure

```
AbstractPhil/tiny-flux-deep/
│
├── model.safetensors                          # Latest training weights
├── model_ema.safetensors                      # EMA weights (use for inference)
├── config.json                                # Model configuration
├── README.md
│
├── scripts/                                   # All Python code
│   ├── model_v4.py                            # v4.1 architecture (current)
│   ├── model_v3.py                            # v3 architecture (reference)
│   ├── model_v2.py                            # v2 architecture (legacy)
│   ├── inference_v3.py                        # Full inference pipeline
│   ├── convert_v3_to_v4.py                    # Checkpoint converter
│   ├── trainer_v3_expert_guidance.py          # Training with distillation
│   ├── trainer_v2.py                          # Previous trainer
│   ├── trainer.py                             # Original trainer
│   ├── port_tiny_to_deep.py                   # TinyFlux → Deep port script
│   └── colab_inference_lailah_early.py        # Simple Colab notebook
│
├── checkpoints/                               # v3 checkpoints (legacy)
│   ├── step_401434.safetensors
│   └── step_401434_ema.safetensors
│
├── checkpoint_runs/                           # Organized checkpoint runs
│   └── v4_init/                               # v4.1 initialization from v3
│       ├── lailah_401434_v4_init.safetensors
│       ├── lailah_401434_v4_init_ema.safetensors
│       ├── lailah_401434_v4_init_ema_secondary.safetensors
│       └── lailah_401434_v4_config.json
│
├── samples/                                   # Generated samples per step
│   └── 20260127_074318_step_401434.png
│
└── logs/                                      # TensorBoard training logs
    └── run_20260126_220714/
```

---

## API Reference

### TinyFluxDeep

```python
class TinyFluxDeep(nn.Module):
    def __init__(self, config: Optional[TinyFluxConfig] = None):
        """Initialize model with config (uses defaults if None)."""
    
    def forward(
        self,
        hidden_states: torch.Tensor,        # [B, N, 16] image latents
        encoder_hidden_states: torch.Tensor, # [B, L, 768] T5 embeddings
        pooled_projections: torch.Tensor,    # [B, 768] CLIP pooled
        timestep: torch.Tensor,              # [B] timestep in [0, 1]
        img_ids: torch.Tensor,               # [N, 3] position IDs
        txt_ids: Optional[torch.Tensor] = None,
        guidance: Optional[torch.Tensor] = None,  # Legacy
        lune_features: Optional[torch.Tensor] = None,  # [B, 1280] SD1.5 features
        sol_stats: Optional[torch.Tensor] = None,      # [B, 3] attention stats
        sol_spatial: Optional[torch.Tensor] = None,    # [B, 8, 8] spatial importance
        expert_features: Optional[torch.Tensor] = None,  # Legacy API
        return_expert_pred: bool = False,
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, Dict]]:
        """
        Forward pass.
        
        Returns:
            output: [B, N, 16] predicted velocity
            expert_info: dict with predictions (if return_expert_pred=True)
        """
    
    def compute_loss(
        self,
        output: torch.Tensor,
        target: torch.Tensor,
        expert_info: Optional[Dict] = None,
        lune_features: Optional[torch.Tensor] = None,
        sol_stats: Optional[torch.Tensor] = None,
        sol_spatial: Optional[torch.Tensor] = None,
        lune_weight: float = 0.1,
        sol_weight: float = 0.05,
        use_huber: bool = True,
        huber_delta: float = 0.1,
        lune_distill_mode: str = "cosine",
        spatial_weighting: bool = True,
    ) -> Dict[str, torch.Tensor]:
        """Compute combined loss with distillation."""
    
    @staticmethod
    def create_img_ids(batch_size: int, height: int, width: int, device) -> torch.Tensor:
        """Create image position IDs for RoPE."""
    
    @staticmethod  
    def create_txt_ids(text_len: int, device) -> torch.Tensor:
        """Create text position IDs."""
    
    def count_parameters(self) -> Dict[str, int]:
        """Count parameters by component."""
```

### Converter Functions

```python
# High-level
def run(step, name="lailah", config=None, ...):
    """One-liner: download, convert, upload."""

def convert_checkpoint(step=None, input_path=None, config=None, ...) -> ConversionResult:
    """Convert checkpoint with full control."""

# Low-level
def analyze_checkpoint(state_dict) -> CheckpointInfo:
    """Analyze checkpoint version and contents."""

def convert_state_dict(state_dict, config=None) -> Tuple[Dict, Dict]:
    """Convert state dict, return (new_state, report)."""

def download_from_hf(step, repo_id, ...) -> Tuple[str, str]:
    """Download checkpoint from HuggingFace."""

# Config
class TinyFluxConfig:
    def to_dict(self) -> Dict
    def from_dict(cls, d) -> TinyFluxConfig
    def from_json(cls, path) -> TinyFluxConfig
    def save_json(self, path, metadata=None)
    def validate_checkpoint(self, state_dict) -> List[str]
```

---

## Samples

### Step 401434 (v3 weights)

**"subject, animal, cat, photograph of a tiger, natural habitat"**

![tiger](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/uJ9Ffh780iLgEIJhmafod.png)

**"subject, bird, blue beak, red eyes, green claws"**

![bird1](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/GRS5tyaFFa0HV2xSJCsin.png)

**"subject, bird, red haired bird in a tree"**

![bird2](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/rGourHokJsPtYNnoFi3Eq.png)

---

## Limitations

| Limitation | Details |
|------------|---------|
| **Resolution** | 512×512 only (64×64 latent space) |
| **Text encoder** | flan-t5-base (768 dim) vs Flux's T5-XXL (4096 dim) |
| **Attention heads** | 4 heads vs Flux's 24 - limits capacity |
| **Training data** | Teacher latents, not real images |
| **v4.1 status** | New architecture, training just starting |
| **Artifacts** | Expect imperfections - research model |

---

## Name

**Lailah** (לילה) — In Jewish tradition, the angel of the night who guards souls and teaches wisdom to the unborn. Chosen for this model's role as a smaller guardian exploring the same latent space as larger models, learning from their knowledge while finding its own path.

---

## Citation

```bibtex
@misc{tinyfluxdeep2026,
  title={TinyFlux-Deep: Compact Flow Matching with Dual Expert Distillation},
  author={AbstractPhil},
  year={2026},
  howpublished={\url{https://huggingface.co/AbstractPhil/tiny-flux-deep}},
  note={246M parameter text-to-image model with Lune trajectory guidance and Sol attention priors}
}
```

---

## Related Projects

| Project | Description |
|---------|-------------|
| [AbstractPhil/tiny-flux](https://huggingface.co/AbstractPhil/tiny-flux) | Original TinyFlux (10.7M params) |
| [AbstractPhil/flux-schnell-teacher-latents](https://huggingface.co/datasets/AbstractPhil/flux-schnell-teacher-latents) | Training dataset |
| [AbstractPhil/imagenet-synthetic](https://huggingface.co/datasets/AbstractPhil/imagenet-synthetic) | ImageNet-style synthetic data |
| [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) | Teacher model |

---

## License

MIT License - free for research and commercial use.

---

**Status**: v4.1 architecture complete. Converting v3 checkpoints and resuming training with dual expert distillation.