nbagel commited on Dec 5, 2025

Commit

4dec1ca

verified ·

1 Parent(s): 2581971

Initial upload: Paris MoE inference code and weights

Browse files

Files changed (41) hide show

.gitattributes +2 -0
.ipynb_checkpoints/instructions-checkpoint.txt +1 -0
.ipynb_checkpoints/test_inference-checkpoint.png +3 -0
.ipynb_checkpoints/test_int8-checkpoint.png +3 -0
README.md +154 -0
__pycache__/generate.cpython-312.pyc +0 -0
benchmark.py +440 -0
benchmark_results.md +41 -0
generate.py +747 -0
instructions.txt +1 -0
quantize.py +435 -0
requirements.txt +7 -0
src/__init__.py +1 -0
src/__pycache__/config.cpython-312.pyc +0 -0
src/__pycache__/models.cpython-312.pyc +0 -0
src/__pycache__/schedules.cpython-312.pyc +0 -0
src/__pycache__/vae_utils.cpython-312.pyc +0 -0
src/config.py +199 -0
src/models.py +1913 -0
src/schedules.py +166 -0
src/vae_utils.py +186 -0
weights/bf16/config.pt +3 -0
weights/bf16/expert_0.safetensors +3 -0
weights/bf16/expert_1.safetensors +3 -0
weights/bf16/expert_2.safetensors +3 -0
weights/bf16/expert_3.safetensors +3 -0
weights/bf16/expert_4.safetensors +3 -0
weights/bf16/expert_5.safetensors +3 -0
weights/bf16/expert_6.safetensors +3 -0
weights/bf16/expert_7.safetensors +3 -0
weights/bf16/router.safetensors +3 -0
weights/bf16/router_config.pt +3 -0
weights/int8/expert_0.safetensors +3 -0
weights/int8/expert_1.safetensors +3 -0
weights/int8/expert_2.safetensors +3 -0
weights/int8/expert_3.safetensors +3 -0
weights/int8/expert_4.safetensors +3 -0
weights/int8/expert_5.safetensors +3 -0
weights/int8/expert_6.safetensors +3 -0
weights/int8/expert_7.safetensors +3 -0
weights/int8/router.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+.ipynb_checkpoints/test_inference-checkpoint.png filter=lfs diff=lfs merge=lfs -text
+.ipynb_checkpoints/test_int8-checkpoint.png filter=lfs diff=lfs merge=lfs -text

.ipynb_checkpoints/instructions-checkpoint.txt ADDED Viewed

	@@ -0,0 +1 @@

+ What we're now is we're going to prepare an inference folder and we're going to make an inference repository for our Paris model. This will include, we will just stick with int8 and bfloat16 and mixed int8, bfloat16 for now. And this repository will include efficient methods how to run the code. It will include quantization code that can accept either PT or save tensors or Float32 save tensors or Float32 PT. It will include a lot of different methods that can accept either PT or Float32 PT. We need to make a visualizer next which outputs a little pretty ASCII chart. We should output the ASCII chart right on the terminal every time we run the inference via this tool. Let's just say we're running the int8 inference of the mixed int8 model. By the way, we're also going to put the weights that we quantized inside this inference folder because we're going to publish this on HuggingFace. have just again, the beef flow 16 and intake weights. we might already be done this by the way. But again, I wanted to do that when we have to keep some kind of track and output a chart in the terminal, like as a little terminal visualization in ASCII. MAKE SURE WE'RE DOING ROUTING PROPERLY. Top 2 etc. Again, just to recap, we're going to make a folder that's just called inference. In this folder, we're going to put the quantized weights that we already made, because we already made them before in the last session. So the bfloat16 and the int8 weights. And we're going to put one Python file for the inference code, and it's going to have all the flags, and it's also going to have a visualized flag. And the visualized flag is actually a lot more than that, because it keeps track of which expert is being used during each inference step, and that shows like a little pretty chart. So if we're generating with 30 steps, which is going to show which experts got to use the most and the least out of eight of them. And so we want to have this in the inference code. Make sure to read files in full before like a pass inference code that we already wrote. Try to list like the most recent files that we made for that. And we also want to have the quantization code to just be an all in one utility with a very nice terminal interface as well, because we want the quantization code to be able to handle float 16 bfloat 16 float 32 weights in both safe tensors and in dot pt format. So that needs to be very smart and tested that it actually works. And also, yeah, make a read me in this folder for the Paris model, because we're going to publish this on hogging face as the inference repository. And then also read all the MD files that we have written here in full because after we do all of this and after we test that it works and it differences fine. We're going to we're going to start to play around with network inference. So that's going to be the fun next step after. So again, make a 20 point to do this for this and please make sure to include at least four or five sentences per point. So the to do list is going to be very long, naturally and very detailed. But I believe we're going to do an excellent, excellent job here.

.ipynb_checkpoints/test_inference-checkpoint.png ADDED Viewed

Git LFS Details

SHA256: 4dba193213b76355dcae180f5631d852e440201f4c8a01cc8c2671fd96394aef
Pointer size: 131 Bytes
Size of remote file: 365 kB

.ipynb_checkpoints/test_int8-checkpoint.png ADDED Viewed

Git LFS Details

SHA256: 772d2190b8cbf97e519337bc959254abc4be981a46795e762fda5c8a9efecd9b
Pointer size: 131 Bytes
Size of remote file: 136 kB

README.md ADDED Viewed

	@@ -0,0 +1,154 @@

+# 🥖 Baguette - Paris MoE Text-to-Image
+A ~5 billion parameter Mixture-of-Experts diffusion model with 8 specialized experts.
+## ⚡ Quick Start
+```bash
+# Install dependencies
+pip install uv && uv pip install torch torchvision safetensors transformers diffusers accelerate tqdm
+# Generate 4 cat images
+python generate.py --prompt "a cute cat" --num_samples 4
+```
+That's it! Images saved to `output_bf16.png`.
+---
+## 🎨 Examples
+```bash
+# Simple generation
+python generate.py --prompt "sunset over mountains"
+# More samples, see expert routing
+python generate.py --prompt "abstract art" --num_samples 16 --visualize
+# Faster with fewer steps
+python generate.py --prompt "a dog" --num_steps 15
+# Lower memory (offload 4 experts to CPU)
+python generate.py --prompt "portrait" --offload 4
+# INT8 weights (smaller, slightly lower quality)
+python generate.py --prompt "forest" --precision int8
+```
+---
+## 📋 All Options
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--prompt` | "a cute cat" | What to generate |
+| `--num_samples` | 16 | Number of images |
+| `--num_steps` | 30 | Sampling steps (20-50 recommended) |
+| `--cfg_scale` | 7.5 | Guidance strength (5-10 recommended) |
+| `--precision` | bf16 | `bf16` (best) or `int8` (smaller) |
+| `--topk` | 2 | Experts per sample (1 or 2) |
+| `--offload` | 0 | Experts to keep on CPU (0-7) |
+| `--visualize` | off | Show expert routing stats |
+| `--output` | auto | Output filename |
+| `--seed` | 999 | Random seed |
+---
+## 🔍 Expert Visualization
+Use `--visualize` to see which experts the router selects:
+```
+╭──────────────────────────────────────────────────╮
+│           ⚡ EXPERT USAGE DISTRIBUTION            │
+├──────────────────────────────────────────────────┤
+│ → E4  │████████████████████████████│ 40.6% │
+│   E2  │██████████████████████████  │ 36.7% │
+│   E6  │██████████               │ 14.8% │
+│   E1  │███                      │  5.5% │
+│   E5  │█                        │  2.3% │
+│   E0  │                         │  0.0% │
+│   E3  │                         │  0.0% │
+│   E7  │                         │  0.0% │
+├──────────────────────────────────────────────────┤
+│  Active: 5/8 experts   Calls: 128               │
+╰──────────────────────────────────────────────────╯
+╭──────────────────────────────────────────────────╮
+│            📈 ROUTING TIMELINE                   │
+├──────────────────────────────────────────────────┤
+│ Step  0  1  2  3  4  5  6  7  8  9 10 11 ...    │
+│ ────────────────────────────────────────────     │
+│  E0   ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·        │
+│  E2   ·  ·  ·  ·  ·  ·  ●  ●  ●  ●  ●  ●        │
+│  E4   ·  ·  ●  ●  ●  ●  ·  ·  ·  ·  ·  ·        │
+│  E6   ●  ●  ·  ·  ·  ·  ·  ·  ·  ·  ·  ·        │
+├──────────────────────────────────────────────────┤
+│  Routing changes:   2/11 steps (18%)            │
+╰──────────────────────────────────────────────────╯
+```
+---
+## 💾 Memory & Speed
+| Config | GPU Memory | Speed |
+|--------|-----------|-------|
+| BF16 (all on GPU) | ~25 GB | ~3 img/s |
+| BF16 + offload 4 | ~14 GB | ~1 img/s |
+| INT8 (all on GPU) | ~12 GB | ~2 img/s |
+| INT8 + offload 4 | ~8 GB | ~0.5 img/s |
+---
+## 🏗️ Architecture
+```
+┌─────────────────────────────────────────┐
+│            Paris MoE Model              │
+├─────────────────────────────────────────┤
+│  Router: DiT-B/2 (129M params)          │
+│    ↓ selects top-K experts              │
+│  Experts: 8× DiT-XL/2 (606M each)       │
+│    ↓ predicts velocity                  │
+│  VAE: Stable Diffusion VAE              │
+��    ↓ decodes to pixels                  │
+│  Output: 256×256 RGB                    │
+└─────────────────────────────────────────┘
+```
+- **Total Parameters**: ~5 Billion
+- **Latent Space**: 32×32×4
+- **Text Encoder**: CLIP ViT-L/14
+---
+## 📁 Files
+```
+├── generate.py      # Main generation script
+├── benchmark.py     # Performance testing
+├── quantize.py      # Weight conversion tool
+├── src/             # Model code
+└── weights/
+    ├── bf16/        # BFloat16 weights (9.3 GB)
+    └── int8/        # INT8 weights (4.8 GB)
+```
+---
+## 🔧 Convert Your Own Weights
+```bash
+# From PyTorch .pt to BF16 safetensors
+python quantize.py --input /path/to/weights --output ./weights/bf16 --format bf16
+# From BF16 to INT8
+python quantize.py --input ./weights/bf16 --output ./weights/int8 --format int8
+```
+---
+## 📜 License
+Apache 2.0

__pycache__/generate.cpython-312.pyc ADDED Viewed

Binary file (37.7 kB). View file

benchmark.py ADDED Viewed

	@@ -0,0 +1,440 @@

+#!/usr/bin/env python3
+"""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                                                                              ║
+║   📊 Paris MoE - Comprehensive Benchmarking Utility 📊                       ║
+║                                                                              ║
+║   Measures performance across precision modes, batch sizes, and configs.     ║
+║   Outputs results as both terminal display and Markdown file.                ║
+║                                                                              ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+Usage:
+    python benchmark.py                          # Run all benchmarks
+    python benchmark.py --quick                  # Quick benchmark (fewer configs)
+    python benchmark.py --precision bf16         # Benchmark specific precision
+    python benchmark.py --output results.md      # Save results to file
+"""
+import argparse
+import sys
+import os
+import time
+import gc
+from pathlib import Path
+from datetime import datetime
+from dataclasses import dataclass
+from typing import List, Dict, Optional
+SCRIPT_DIR = Path(__file__).parent.absolute()
+SRC_DIR = SCRIPT_DIR / "src"
+sys.path.insert(0, str(SRC_DIR))
+import torch
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              DATA STRUCTURES
+# ═══════════════════════════════════════════════════════════════════════════════
+@dataclass
+class BenchmarkResult:
+    """Single benchmark result."""
+    precision: str
+    num_samples: int
+    num_steps: int
+    topk: int
+    offload: int
+    load_time: float  # Model loading time (seconds)
+    gen_time: float   # Generation time (seconds)
+    decode_time: float  # VAE decoding time (seconds)
+    peak_memory_gb: float  # Peak GPU memory usage
+    @property
+    def total_time(self) -> float:
+        return self.gen_time + self.decode_time
+    @property
+    def throughput(self) -> float:
+        """Images per second (generation only)."""
+        return self.num_samples / self.gen_time if self.gen_time > 0 else 0
+    @property
+    def time_per_step(self) -> float:
+        """Seconds per sampling step."""
+        return self.gen_time / self.num_steps if self.num_steps > 0 else 0
+    @property
+    def time_per_image(self) -> float:
+        """Seconds per image (generation only)."""
+        return self.gen_time / self.num_samples if self.num_samples > 0 else 0
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              BENCHMARK RUNNER
+# ═══════════════════════════════════════════════════════════════════════════════
+def get_gpu_memory_gb() -> float:
+    """Get current GPU memory usage in GB."""
+    if torch.cuda.is_available():
+        return torch.cuda.max_memory_allocated() / (1024 ** 3)
+    return 0.0
+def reset_gpu_memory():
+    """Reset GPU memory tracking."""
+    if torch.cuda.is_available():
+        torch.cuda.reset_peak_memory_stats()
+        torch.cuda.empty_cache()
+        gc.collect()
+def run_single_benchmark(precision: str, num_samples: int, num_steps: int,
+                          topk: int, offload: int, device: str = 'cuda') -> BenchmarkResult:
+    """Run a single benchmark configuration."""
+    from generate import load_sampler
+    reset_gpu_memory()
+    # Load model
+    start_load = time.time()
+    sampler = load_sampler(precision=precision, device=device, offload=offload)
+    load_time = time.time() - start_load
+    # Set seed for reproducibility
+    torch.manual_seed(42)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(42)
+    # Warmup run
+    _ = sampler.sample(
+        num_samples=1,
+        text_prompts=["warmup"],
+        cfg_scale=7.5,
+        num_steps=2,
+        use_bf16=(precision == 'bf16'),
+        topk=topk
+    )
+    reset_gpu_memory()
+    torch.cuda.synchronize()
+    # Timed generation
+    start_gen = time.time()
+    latents = sampler.sample(
+        num_samples=num_samples,
+        text_prompts=["a cute cat"],
+        cfg_scale=7.5,
+        num_steps=num_steps,
+        use_bf16=(precision == 'bf16'),
+        topk=topk
+    )
+    torch.cuda.synchronize()
+    gen_time = time.time() - start_gen
+    # Timed decoding
+    start_decode = time.time()
+    images = sampler.vae_manager.decode(latents)
+    torch.cuda.synchronize()
+    decode_time = time.time() - start_decode
+    peak_memory = get_gpu_memory_gb()
+    # Cleanup
+    del sampler, latents, images
+    gc.collect()
+    torch.cuda.empty_cache()
+    return BenchmarkResult(
+        precision=precision,
+        num_samples=num_samples,
+        num_steps=num_steps,
+        topk=topk,
+        offload=offload,
+        load_time=load_time,
+        gen_time=gen_time,
+        decode_time=decode_time,
+        peak_memory_gb=peak_memory
+    )
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              OUTPUT FORMATTERS
+# ═══════════════════════════════════════════════════════════════════════════════
+def format_terminal_results(results: List[BenchmarkResult], gpu_name: str) -> str:
+    """Format results for terminal display."""
+    lines = []
+    lines.append("""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                     📊 PARIS MoE BENCHMARK RESULTS 📊                        ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+    """)
+    lines.append(f"  GPU: {gpu_name}")
+    lines.append(f"  Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    lines.append("")
+    # Group by precision
+    precisions = sorted(set(r.precision for r in results))
+    for precision in precisions:
+        prec_results = [r for r in results if r.precision == precision]
+        lines.append(f"┌{'─'*78}┐")
+        lines.append(f"│  {precision.upper()} Precision{' '*65}│")
+        lines.append(f"├{'─'*78}┤")
+        lines.append(f"│ {'Samples':>8} │ {'Steps':>6} │ {'TopK':>5} │ {'Offload':>7} │ "
+                    f"{'Gen(s)':>8} │ {'Img/s':>6} │ {'s/step':>6} │ {'Mem(GB)':>8} │")
+        lines.append(f"├{'─'*78}┤")
+        for r in prec_results:
+            lines.append(
+                f"│ {r.num_samples:>8} │ {r.num_steps:>6} │ {r.topk:>5} │ {r.offload:>7} │ "
+                f"{r.gen_time:>8.2f} │ {r.throughput:>6.2f} │ {r.time_per_step:>6.3f} │ "
+                f"{r.peak_memory_gb:>8.2f} │"
+            )
+        lines.append(f"└{'─'*78}┘")
+        lines.append("")
+    # Summary
+    if results:
+        fastest = min(results, key=lambda r: r.time_per_image)
+        most_efficient = min(results, key=lambda r: r.peak_memory_gb)
+        lines.append("┌─────────────────────────────────────────────────────────────────┐")
+        lines.append("│                          📈 SUMMARY                             │")
+        lines.append("├─────────────────────────────────────────────────────────────────┤")
+        lines.append(f"│  🏆 Fastest:         {fastest.precision.upper():>6} @ {fastest.throughput:.2f} img/s              │")
+        lines.append(f"│  💾 Most Efficient:  {most_efficient.precision.upper():>6} @ {most_efficient.peak_memory_gb:.1f} GB peak          │")
+        lines.append("└─────────────────────────────────────────────────────────────────┘")
+    return "\n".join(lines)
+def format_markdown_results(results: List[BenchmarkResult], gpu_name: str) -> str:
+    """Format results as Markdown."""
+    lines = []
+    lines.append("# 📊 Paris MoE Benchmark Results")
+    lines.append("")
+    lines.append(f"**GPU:** {gpu_name}")
+    lines.append(f"**Date:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    lines.append("")
+    lines.append("## 🏗️ Model Architecture")
+    lines.append("")
+    lines.append("| Component | Details |")
+    lines.append("|-----------|---------|")
+    lines.append("| Experts | 8× DiT-XL/2 (606M params each) |")
+    lines.append("| Router | DiT-B/2 (129M params) |")
+    lines.append("| Total | ~5 Billion parameters |")
+    lines.append("| VAE | SD-VAE (stabilityai/sd-vae-ft-mse) |")
+    lines.append("| Text Encoder | CLIP ViT-L/14 |")
+    lines.append("")
+    # Group by precision
+    precisions = sorted(set(r.precision for r in results))
+    for precision in precisions:
+        prec_results = [r for r in results if r.precision == precision]
+        lines.append(f"## {precision.upper()} Precision")
+        lines.append("")
+        lines.append("| Samples | Steps | TopK | Offload | Gen Time (s) | Throughput (img/s) | Time/Step (s) | Peak Memory (GB) |")
+        lines.append("|---------|-------|------|---------|--------------|-------------------|---------------|------------------|")
+        for r in prec_results:
+            lines.append(
+                f"| {r.num_samples} | {r.num_steps} | {r.topk} | {r.offload} | "
+                f"{r.gen_time:.2f} | {r.throughput:.2f} | {r.time_per_step:.3f} | {r.peak_memory_gb:.2f} |"
+            )
+        lines.append("")
+    # Summary
+    if results:
+        lines.append("## 📈 Summary")
+        lines.append("")
+        fastest = min(results, key=lambda r: r.time_per_image)
+        most_efficient = min(results, key=lambda r: r.peak_memory_gb)
+        lines.append(f"- **🏆 Fastest Configuration:** {fastest.precision.upper()}, "
+                    f"{fastest.num_samples} samples @ {fastest.throughput:.2f} img/s")
+        lines.append(f"- **💾 Most Memory Efficient:** {most_efficient.precision.upper()} "
+                    f"with offload={most_efficient.offload} @ {most_efficient.peak_memory_gb:.1f} GB peak")
+        lines.append("")
+        # Recommendations
+        lines.append("## 🎯 Recommendations")
+        lines.append("")
+        lines.append("| Use Case | Precision | Offload | Expected Performance |")
+        lines.append("|----------|-----------|---------|---------------------|")
+        bf16_results = [r for r in results if r.precision == 'bf16' and r.offload == 0]
+        if bf16_results:
+            r = bf16_results[0]
+            lines.append(f"| **Production (Quality)** | BF16 | 0 | {r.throughput:.2f} img/s, {r.peak_memory_gb:.1f} GB |")
+        int8_results = [r for r in results if r.precision == 'int8' and r.offload == 0]
+        if int8_results:
+            r = int8_results[0]
+            lines.append(f"| **Balanced** | INT8 | 0 | {r.throughput:.2f} img/s, {r.peak_memory_gb:.1f} GB |")
+        offload_results = [r for r in results if r.offload > 0]
+        if offload_results:
+            r = min(offload_results, key=lambda x: x.peak_memory_gb)
+            lines.append(f"| **Low VRAM** | {r.precision.upper()} | {r.offload} | {r.throughput:.2f} img/s, {r.peak_memory_gb:.1f} GB |")
+    lines.append("")
+    lines.append("---")
+    lines.append("*Generated by Paris MoE Benchmark Utility*")
+    return "\n".join(lines)
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              MAIN
+# ═══════════════════════════════════════════════════════════════════════════════
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="📊 Paris MoE - Benchmark Utility",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python benchmark.py                      # Full benchmark suite
+  python benchmark.py --quick              # Quick benchmark
+  python benchmark.py --precision bf16     # BF16 only
+  python benchmark.py --output results.md  # Save to file
+        """
+    )
+    parser.add_argument("--quick", action="store_true",
+                        help="Run quick benchmark with fewer configurations")
+    parser.add_argument("--precision", type=str, default=None,
+                        choices=["bf16", "int8", "mixed"],
+                        help="Benchmark specific precision only")
+    parser.add_argument("--output", "-o", type=str, default=None,
+                        help="Output Markdown file path")
+    parser.add_argument("--samples", type=int, default=None,
+                        help="Override number of samples")
+    parser.add_argument("--steps", type=int, default=None,
+                        help="Override number of steps")
+    return parser.parse_args()
+def get_benchmark_configs(args) -> List[Dict]:
+    """Get list of benchmark configurations to run."""
+    configs = []
+    if args.quick:
+        # Quick benchmark: minimal configs
+        precisions = [args.precision] if args.precision else ['bf16', 'int8']
+        samples = args.samples or 4
+        steps = args.steps or 10
+        for precision in precisions:
+            configs.append({
+                'precision': precision,
+                'num_samples': samples,
+                'num_steps': steps,
+                'topk': 1,
+                'offload': 0
+            })
+    else:
+        # Full benchmark suite
+        precisions = [args.precision] if args.precision else ['bf16', 'int8']
+        samples_list = [args.samples] if args.samples else [4, 16]
+        steps_list = [args.steps] if args.steps else [20, 30]
+        topk_list = [1, 2]
+        offload_list = [0, 4]
+        for precision in precisions:
+            for samples in samples_list:
+                for steps in steps_list:
+                    for topk in topk_list:
+                        for offload in offload_list:
+                            configs.append({
+                                'precision': precision,
+                                'num_samples': samples,
+                                'num_steps': steps,
+                                'topk': topk,
+                                'offload': offload
+                            })
+    return configs
+def main():
+    args = parse_args()
+    print("""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                                                                              ║
+║   📊 Paris MoE - Comprehensive Benchmarking Utility 📊                       ║
+║                                                                              ║
+║   Measuring performance across precision modes, batch sizes, and configs.    ║
+║                                                                              ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+    """)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    if device != "cuda":
+        print("⚠️  Warning: Running on CPU. Benchmarks will be slow.")
+    gpu_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
+    print(f"🖥️  Device: {gpu_name}")
+    configs = get_benchmark_configs(args)
+    print(f"📋 Running {len(configs)} benchmark configurations...\n")
+    results = []
+    for i, config in enumerate(configs):
+        print(f"[{i+1}/{len(configs)}] {config['precision'].upper()} | "
+              f"{config['num_samples']} samples | {config['num_steps']} steps | "
+              f"Top-{config['topk']} | Offload {config['offload']}")
+        try:
+            result = run_single_benchmark(
+                precision=config['precision'],
+                num_samples=config['num_samples'],
+                num_steps=config['num_steps'],
+                topk=config['topk'],
+                offload=config['offload'],
+                device=device
+            )
+            results.append(result)
+            print(f"    ✅ {result.gen_time:.2f}s, {result.throughput:.2f} img/s, "
+                  f"{result.peak_memory_gb:.1f} GB peak")
+        except Exception as e:
+            print(f"    ❌ Failed: {e}")
+        print()
+    if not results:
+        print("❌ No successful benchmarks!")
+        return 1
+    # Print terminal results
+    terminal_output = format_terminal_results(results, gpu_name)
+    print(terminal_output)
+    # Save Markdown if requested
+    if args.output:
+        md_output = format_markdown_results(results, gpu_name)
+        with open(args.output, 'w') as f:
+            f.write(md_output)
+        print(f"\n✅ Results saved to: {args.output}")
+    return 0
+if __name__ == "__main__":
+    exit(main())

benchmark_results.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# 📊 Paris MoE Benchmark Results
+**GPU:** NVIDIA RTX 6000 Ada Generation
+**Date:** 2025-12-05 16:35:39
+## 🏗️ Model Architecture
+| Component | Details |
+|-----------|---------|
+| Experts | 8× DiT-XL/2 (606M params each) |
+| Router | DiT-B/2 (129M params) |
+| Total | ~5 Billion parameters |
+| VAE | SD-VAE (stabilityai/sd-vae-ft-mse) |
+| Text Encoder | CLIP ViT-L/14 |
+## BF16 Precision
+| Samples | Steps | TopK | Offload | Gen Time (s) | Throughput (img/s) | Time/Step (s) | Peak Memory (GB) |
+|---------|-------|------|---------|--------------|-------------------|---------------|------------------|
+| 4 | 10 | 1 | 0 | 1.49 | 2.68 | 0.149 | 10.79 |
+## INT8 Precision
+| Samples | Steps | TopK | Offload | Gen Time (s) | Throughput (img/s) | Time/Step (s) | Peak Memory (GB) |
+|---------|-------|------|---------|--------------|-------------------|---------------|------------------|
+| 4 | 10 | 1 | 0 | 2.12 | 1.89 | 0.212 | 20.17 |
+## 📈 Summary
+- **🏆 Fastest Configuration:** BF16, 4 samples @ 2.68 img/s
+- **💾 Most Memory Efficient:** BF16 with offload=0 @ 10.8 GB peak
+## 🎯 Recommendations
+| Use Case | Precision | Offload | Expected Performance |
+|----------|-----------|---------|---------------------|
+| **Production (Quality)** | BF16 | 0 | 2.68 img/s, 10.8 GB |
+| **Balanced** | INT8 | 0 | 1.89 img/s, 20.2 GB |
+---
+*Generated by Paris MoE Benchmark Utility*

generate.py ADDED Viewed

	@@ -0,0 +1,747 @@

+#!/usr/bin/env python3
+"""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                                                                              ║
+║   🎨 Paris MoE - Unified Image Generation Script 🎨                         ║
+║                                                                              ║
+║   Mixture-of-Experts Diffusion Model (8× DiT-XL/2 + DiT-B/2 Router)         ║
+║   ~5 Billion Parameters Total                                                ║
+║                                                                              ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+Supports multiple precision modes:
+  - bf16:  Best quality, 9.3GB total (~1.2GB per expert)
+  - int8:  Good quality, 4.8GB total (~580MB per expert), 15x compression
+  - mixed: Router in bf16, experts in int8 (balanced)
+Memory Offloading:
+  - --offload N: Keep N experts in CPU memory, move to GPU only during computation
+  - Experts are moved to GPU → compute → moved back to CPU (memory offloading)
+  - All computation happens on GPU, only storage is on CPU
+Usage:
+    python generate.py --prompt "a cute cat" --precision bf16
+    python generate.py --prompt "a sunset over mountains" --precision int8 --visualize
+    python generate.py --prompt "abstract art" --precision mixed --num_samples 4 --topk 2
+"""
+import argparse
+import sys
+import os
+import time
+from pathlib import Path
+# Add src to path for model imports
+SCRIPT_DIR = Path(__file__).parent.absolute()
+SRC_DIR = SCRIPT_DIR / "src"
+sys.path.insert(0, str(SRC_DIR))
+import torch
+import torch.nn.functional as F
+from tqdm import tqdm
+from torchvision.utils import make_grid, save_image
+from safetensors.torch import load_file
+from safetensors import safe_open
+from transformers import CLIPTextModel, CLIPTokenizer
+from collections import defaultdict
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              WEIGHT PATHS
+# ═══════════════════════════════════════════════════════════════════════════════
+WEIGHTS_DIR = SCRIPT_DIR / "weights"
+BF16_DIR = WEIGHTS_DIR / "bf16"
+INT8_DIR = WEIGHTS_DIR / "int8"
+# ═══════════════════════════════════════════════════════════════════════════════
+#                           ASCII VISUALIZATION
+# ═══════════════════════════════════════════════════════════════════════════════
+class ExpertTracker:
+    """Tracks which experts are used during generation for visualization."""
+    def __init__(self, num_experts: int = 8):
+        self.num_experts = num_experts
+        self.usage_counts = defaultdict(int)
+        self.per_step_primary = []
+        self.total_calls = 0
+    def record(self, expert_ids: torch.Tensor, step: int, weights: torch.Tensor = None):
+        """Record expert usage for a batch at a given step."""
+        step_counts = defaultdict(float)
+        if weights is not None:
+            for eid, w in zip(expert_ids.flatten().tolist(), weights.flatten().tolist()):
+                self.usage_counts[eid] += 1
+                step_counts[eid] += w
+                self.total_calls += 1
+        else:
+            for eid in expert_ids.tolist():
+                self.usage_counts[eid] += 1
+                step_counts[eid] += 1.0
+                self.total_calls += 1
+        if step_counts:
+            self.per_step_primary.append(max(step_counts, key=step_counts.get))
+    def get_usage_chart(self) -> str:
+        """Chart 1: Expert usage ranked by frequency."""
+        if self.total_calls == 0:
+            return ""
+        # Sort experts by usage
+        sorted_experts = sorted(range(8), key=lambda e: self.usage_counts.get(e, 0), reverse=True)
+        max_count = max(self.usage_counts.values()) if self.usage_counts else 1
+        unique = sum(1 for e in range(8) if self.usage_counts.get(e, 0) > 0)
+        lines = [
+            "",
+            "╭────────────────��─────────────────────────────────╮",
+            "│           ⚡ EXPERT USAGE DISTRIBUTION            │",
+            "├──────────────────────────────────────────────────┤",
+        ]
+        bars = ['▏', '▎', '▍', '▌', '▋', '▊', '▉', '█']
+        for eid in sorted_experts:
+            count = self.usage_counts.get(eid, 0)
+            pct = 100 * count / self.total_calls if self.total_calls > 0 else 0
+            # Create gradient bar
+            bar_width = 28
+            fill = (count / max_count) * bar_width if max_count > 0 else 0
+            full_blocks = int(fill)
+            partial = int((fill - full_blocks) * 8)
+            bar = '█' * full_blocks
+            if partial > 0 and full_blocks < bar_width:
+                bar += bars[partial - 1]
+            bar = bar.ljust(bar_width, ' ')
+            marker = "→" if count == max_count and count > 0 else " "
+            lines.append(f"│ {marker} E{eid}  │{bar}│ {pct:5.1f}% │")
+        lines.extend([
+            "├──────────────────────────────────────────────────┤",
+            f"│  Active: {unique}/8 experts   Calls: {self.total_calls:<13} │",
+            "╰──────────────────────────────────────────────────╯",
+        ])
+        return "\n".join(lines)
+    def get_timeline(self) -> str:
+        """Chart 2: Visual timeline of expert selection per step."""
+        if not self.per_step_primary:
+            return ""
+        num_steps = len(self.per_step_primary)
+        show_steps = min(20, num_steps)
+        # Count transitions
+        transitions = sum(1 for i in range(1, num_steps)
+                         if self.per_step_primary[i] != self.per_step_primary[i-1])
+        lines = [
+            "",
+            "╭──────────────────────────────────────────────────╮",
+            "│            📈 ROUTING TIMELINE                   │",
+            "├──────────────────────────────────────────────────┤",
+        ]
+        # Compact step numbers
+        step_row = "│ Step "
+        for i in range(show_steps):
+            step_row += f"{i:2d} "
+        if num_steps > 20:
+            step_row = step_row[:48] + "..│"
+        else:
+            step_row = step_row[:48].ljust(48) + " │"
+        lines.append(step_row)
+        lines.append("│ " + "───" * 16 + " │")
+        # Show each expert's timeline
+        symbols = ['○', '●']
+        for eid in range(self.num_experts):
+            row = f"│  E{eid}  "
+            for step in range(show_steps):
+                if self.per_step_primary[step] == eid:
+                    row += " ● "
+                else:
+                    row += " · "
+            if num_steps > 20:
+                row = row[:48] + "..│"
+            else:
+                row = row[:48].ljust(48) + " │"
+            lines.append(row)
+        lines.extend([
+            "├──────────────────────────────────────────────────┤",
+            f"│  Routing changes: {transitions:>3}/{num_steps-1:<3} steps ({100*transitions/(num_steps-1):.0f}%)       │",
+            "╰──────────────────────────────────────────────────╯",
+        ])
+        return "\n".join(lines)
+# ═══════════════════════════════════════════════════════════════════════════════
+#                           INT8 DEQUANTIZATION
+# ═══════════════════════════════════════════════════════════════════════════════
+def dequantize_tensor(int8_tensor: torch.Tensor, t_min: float, t_max: float) -> torch.Tensor:
+    """Dequantize INT8 tensor back to float32."""
+    if t_min == t_max:
+        return torch.full_like(int8_tensor, t_min, dtype=torch.float32)
+    normalized = (int8_tensor.float() + 128) / 255.0
+    return normalized * (t_max - t_min) + t_min
+def load_int8_state_dict(safetensors_path: Path) -> dict:
+    """Load and dequantize INT8 safetensors to float32 state_dict."""
+    state_dict = {}
+    with safe_open(str(safetensors_path), framework="pt", device="cpu") as f:
+        keys = list(f.keys())
+        # Find quantized tensors (those with _min/_max companions)
+        quantized_keys = set()
+        for key in keys:
+            if key.endswith('._min'):
+                base_key = key[:-5]
+                quantized_keys.add(base_key)
+        # Load and dequantize
+        for key in keys:
+            # Skip metadata and quantization params
+            if key.endswith('._min') or key.endswith('._max'):
+                continue
+            if key == '_config_json':
+                continue
+            tensor = f.get_tensor(key)
+            if key in quantized_keys:
+                # Dequantize INT8 tensor
+                t_min = f.get_tensor(f"{key}._min").item()
+                t_max = f.get_tensor(f"{key}._max").item()
+                tensor = dequantize_tensor(tensor, t_min, t_max)
+            state_dict[key] = tensor
+    return state_dict
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              MODEL CREATION
+# ═══════════════════════════════════════════════════════════════════════════════
+def create_expert(config, expert_id: int = 0):
+    """Create a DiT expert model."""
+    from models import DiTExpert
+    return DiTExpert(config)
+def create_router(config):
+    """Create a DiT router model."""
+    from models import DiTRouter
+    return DiTRouter(config)
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              SAMPLER CLASS
+# ═══════════════════════════════════════════════════════════════════════════════
+class ParisSampler:
+    """Unified sampler for Paris MoE model with expert tracking."""
+    def __init__(self, experts: dict, router, vae_manager, config, device='cuda',
+                 offloaded_experts: set = None):
+        self.experts = experts
+        self.router = router
+        self.vae_manager = vae_manager
+        self.config = config
+        self.device = device
+        self.tracker = None
+        self.offloaded_experts = offloaded_experts or set()  # Which experts are on CPU
+        # Set models to eval mode
+        for expert in self.experts.values():
+            expert.eval()
+        if self.router is not None:
+            self.router.eval()
+        # Precompute null embeddings for CFG
+        self._precompute_null_embeddings()
+    def _precompute_null_embeddings(self):
+        """Precompute null embeddings for classifier-free guidance."""
+        try:
+            text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
+            tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
+            text_encoder = text_encoder.to(self.device)
+            text_encoder.eval()
+            with torch.no_grad():
+                null_tokens = tokenizer(
+                    [""],
+                    max_length=77,
+                    padding='max_length',
+                    truncation=True,
+                    return_tensors='pt'
+                )
+                self.null_text_embeds = text_encoder(null_tokens.input_ids.to(self.device)).last_hidden_state
+                self.null_attention_mask = null_tokens.attention_mask.to(self.device)
+            del text_encoder, tokenizer
+            torch.cuda.empty_cache()
+        except Exception as e:
+            print(f"Warning: Could not precompute null text embeddings: {e}")
+            self.null_text_embeds = None
+            self.null_attention_mask = None
+    def _encode_text_prompts(self, text_prompts: list):
+        """Encode text prompts using CLIP."""
+        text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
+        tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
+        text_encoder = text_encoder.to(self.device)
+        text_encoder.eval()
+        tokenizer_output = tokenizer(
+            text_prompts,
+            max_length=77,
+            padding='max_length',
+            truncation=True,
+            return_tensors='pt'
+        )
+        tokens = tokenizer_output.input_ids.to(self.device)
+        attention_mask = tokenizer_output.attention_mask.to(self.device)
+        with torch.no_grad():
+            text_embeds = text_encoder(tokens).last_hidden_state
+        del text_encoder, tokenizer
+        torch.cuda.empty_cache()
+        return text_embeds, attention_mask
+    def _move_expert_to_gpu(self, expert_id: int):
+        """Move an offloaded expert to GPU for computation."""
+        if expert_id in self.offloaded_experts:
+            self.experts[expert_id] = self.experts[expert_id].to(self.device)
+            torch.cuda.synchronize()  # Ensure transfer is complete
+    def _move_expert_to_cpu(self, expert_id: int):
+        """Move expert back to CPU after computation (memory offloading)."""
+        if expert_id in self.offloaded_experts:
+            self.experts[expert_id] = self.experts[expert_id].cpu()
+            torch.cuda.empty_cache()  # Free GPU memory immediately
+    def _run_expert_with_cfg(self, expert_id: int, samples: torch.Tensor, t: torch.Tensor,
+                              text_embeds: torch.Tensor, attention_mask: torch.Tensor,
+                              null_embeds: torch.Tensor, null_mask: torch.Tensor,
+                              cfg_scale: float) -> torch.Tensor:
+        """Run expert inference with optional CFG, handling memory offloading."""
+        # Move to GPU if offloaded
+        self._move_expert_to_gpu(expert_id)
+        expert = self.experts[expert_id]
+        try:
+            if cfg_scale != 1.0:
+                v_cond = expert(samples, t, text_embeds, attention_mask)
+                v_uncond = expert(samples, t, null_embeds, null_mask)
+                v_pred = v_uncond + cfg_scale * (v_cond - v_uncond)
+            else:
+                v_pred = expert(samples, t, text_embeds, attention_mask)
+            return v_pred
+        finally:
+            # Move back to CPU if it was offloaded (memory offloading)
+            self._move_expert_to_cpu(expert_id)
+    def sample(self, num_samples: int, text_prompts: list, cfg_scale: float = 7.5,
+               num_steps: int = 30, use_bf16: bool = True, track_experts: bool = False,
+               topk: int = 1):
+        """
+        Generate samples using expert routing.
+        Args:
+            num_samples: Number of images to generate
+            text_prompts: List of text prompts
+            cfg_scale: Classifier-free guidance scale
+            num_steps: Number of sampling steps
+            use_bf16: Use bfloat16 precision
+            track_experts: Track and visualize expert usage
+            topk: Number of experts to use per sample (1=top-1, 2=top-2, etc.)
+        """
+        # Initialize tracker if requested
+        if track_experts:
+            self.tracker = ExpertTracker(num_experts=8)
+        else:
+            self.tracker = None
+        text_embeds, attention_mask = self._encode_text_prompts(text_prompts)
+        latent_size = self.config.image_size
+        channels = 4
+        dtype = torch.bfloat16 if use_bf16 else torch.float32
+        # Start with random noise
+        samples = torch.randn(
+            num_samples, channels, latent_size, latent_size,
+            device=self.device, dtype=dtype
+        )
+        # Convert text embeds to appropriate dtype
+        text_embeds = text_embeds.to(dtype)
+        if self.null_text_embeds is not None:
+            null_text_embeds = self.null_text_embeds.to(dtype)
+            null_attention_mask = self.null_attention_mask
+        dt = 1.0 / num_steps
+        autocast_ctx = torch.amp.autocast(device_type='cuda', dtype=dtype) if use_bf16 else torch.no_grad()
+        with torch.no_grad(), autocast_ctx:
+            for i in tqdm(range(num_steps), desc="🎨 Generating"):
+                t = torch.ones(num_samples, device=self.device) * (1.0 - i * dt)
+                # Expand text embeddings if needed
+                batch_text_embeds = text_embeds.expand(num_samples, -1, -1) if text_embeds.shape[0] == 1 else text_embeds[:num_samples]
+                batch_attention_mask = attention_mask.expand(num_samples, -1) if attention_mask.shape[0] == 1 else attention_mask[:num_samples]
+                # Get router predictions (router expects float32)
+                with torch.amp.autocast(device_type='cuda', enabled=False):
+                    router_logits = self.router(samples.float(), t.float())
+                expert_probs = F.softmax(router_logits, dim=1)
+                if topk == 1:
+                    # Top-1 routing
+                    expert_choices = torch.argmax(expert_probs, dim=1)
+                    # Track expert usage
+                    if self.tracker is not None:
+                        self.tracker.record(expert_choices, i)
+                    # Predict velocity for each sample using selected expert
+                    v_pred = torch.zeros_like(samples)
+                    for expert_id in range(8):
+                        mask = (expert_choices == expert_id)
+                        if mask.any():
+                            mask_size = mask.sum().item()
+                            null_embeds = null_text_embeds.expand(mask_size, -1, -1)
+                            null_mask = null_attention_mask.expand(mask_size, -1)
+                            v_batch = self._run_expert_with_cfg(
+                                expert_id,
+                                samples[mask], t[mask],
+                                batch_text_embeds[mask], batch_attention_mask[mask],
+                                null_embeds, null_mask,
+                                cfg_scale
+                            )
+                            v_pred[mask] = v_batch
+                else:
+                    # Top-K routing with weighted ensemble
+                    topk_probs, topk_indices = torch.topk(expert_probs, k=min(topk, 8), dim=1)
+                    topk_probs = topk_probs / topk_probs.sum(dim=1, keepdim=True)  # Renormalize
+                    # Track expert usage
+                    if self.tracker is not None:
+                        self.tracker.record(topk_indices, i, topk_probs)
+                    v_pred = torch.zeros_like(samples)
+                    # Process each sample
+                    for sample_idx in range(num_samples):
+                        v_sample = torch.zeros(channels, latent_size, latent_size,
+                                              device=self.device, dtype=dtype)
+                        for k_idx in range(topk_indices.shape[1]):
+                            expert_id = topk_indices[sample_idx, k_idx].item()
+                            weight = topk_probs[sample_idx, k_idx].item()
+                            null_embeds = null_text_embeds
+                            null_mask = null_attention_mask
+                            v_expert = self._run_expert_with_cfg(
+                                expert_id,
+                                samples[sample_idx:sample_idx+1],
+                                t[sample_idx:sample_idx+1],
+                                batch_text_embeds[sample_idx:sample_idx+1],
+                                batch_attention_mask[sample_idx:sample_idx+1],
+                                null_embeds, null_mask,
+                                cfg_scale
+                            )
+                            v_sample += weight * v_expert.squeeze(0)
+                        v_pred[sample_idx] = v_sample
+                # Euler integration step
+                samples = samples - v_pred * dt
+        return samples.float()
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              MODEL LOADING
+# ═══════════════════════════════════════════════════════════════════════════════
+def load_sampler(precision: str = 'bf16', device: str = 'cuda', offload: int = 0):
+    """
+    Load Paris MoE sampler with specified precision.
+    Args:
+        precision: Weight precision ('bf16', 'int8', 'mixed')
+        device: Compute device ('cuda' or 'cpu')
+        offload: Number of experts to keep in CPU memory (0-7)
+                 These experts will be moved to GPU only during computation.
+    """
+    from vae_utils import VAEManager
+    # Determine weight directories based on precision
+    if precision == 'bf16':
+        expert_dir = BF16_DIR
+        router_dir = BF16_DIR
+        use_int8_experts = False
+    elif precision == 'int8':
+        expert_dir = INT8_DIR
+        router_dir = BF16_DIR  # Router always from bf16
+        use_int8_experts = True
+    elif precision == 'mixed':
+        expert_dir = INT8_DIR
+        router_dir = BF16_DIR
+        use_int8_experts = True
+    else:
+        raise ValueError(f"Unknown precision: {precision}. Use 'bf16', 'int8', or 'mixed'.")
+    # Load config
+    config_path = BF16_DIR / 'config.pt'
+    config_data = torch.load(config_path, map_location='cpu', weights_only=False)
+    config = config_data['config']
+    # Load router config
+    router_config_path = BF16_DIR / 'router_config.pt'
+    router_config_data = torch.load(router_config_path, map_location='cpu', weights_only=False)
+    router_config = router_config_data['config']
+    # Update config with router params
+    config.router_architecture = router_config.router_architecture
+    config.router_params = router_config.router_params
+    # Load router (always on GPU, bf16/float32)
+    print("📡 Loading router...")
+    router = create_router(config).to(device)
+    router_weights = load_file(str(router_dir / 'router.safetensors'))
+    router_weights = {k: v.float() for k, v in router_weights.items()}
+    router.load_state_dict(router_weights)
+    router.eval()
+    # Determine which experts to offload
+    # Offload the LAST N experts (highest IDs)
+    offloaded_experts = set(range(8 - offload, 8)) if offload > 0 else set()
+    # Load experts
+    experts = {}
+    for i in range(8):
+        print(f"🧠 Loading expert {i}...", end="")
+        expert = create_expert(config, expert_id=i)
+        if use_int8_experts:
+            expert_weights = load_int8_state_dict(expert_dir / f'expert_{i}.safetensors')
+        else:
+            expert_weights = load_file(str(expert_dir / f'expert_{i}.safetensors'))
+        expert.load_state_dict(expert_weights)
+        expert.eval()
+        # Convert to bf16 if using bf16 precision
+        if precision == 'bf16':
+            expert = expert.to(torch.bfloat16)
+        # Decide where to place the expert
+        if i in offloaded_experts:
+            expert = expert.cpu()  # Keep in CPU memory
+            print(f" 💾 (CPU memory, GPU compute)")
+        else:
+            expert = expert.to(device)  # Keep on GPU
+            print(f" 🎮 (GPU)")
+        experts[i] = expert
+    # Load VAE
+    print("🖼️  Loading VAE...")
+    vae_manager = VAEManager(device=device)
+    return ParisSampler(experts, router, vae_manager, config, device, offloaded_experts)
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              MAIN ENTRYPOINT
+# ═══════════════════════════════════════════════════════════════════════════════
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="🎨 Paris MoE - Image Generation",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  python generate.py --prompt "a cute cat playing piano"
+  python generate.py --prompt "sunset over mountains" --precision int8 --visualize
+  python generate.py --prompt "abstract art" --num_samples 4 --cfg_scale 10 --topk 2
+  python generate.py --prompt "portrait" --offload 4  # Offload 4 experts to CPU memory
+        """
+    )
+    parser.add_argument("--prompt", type=str, default="a cute cat",
+                        help="Text prompt for generation")
+    parser.add_argument("--num_samples", type=int, default=16,
+                        help="Number of images to generate (default: 16)")
+    parser.add_argument("--cfg_scale", type=float, default=7.5,
+                        help="Classifier-free guidance scale (default: 7.5)")
+    parser.add_argument("--num_steps", type=int, default=30,
+                        help="Number of sampling steps (default: 30)")
+    parser.add_argument("--seed", type=int, default=999,
+                        help="Random seed for reproducibility")
+    parser.add_argument("--output", type=str, default=None,
+                        help="Output image path (default: output_<precision>.png)")
+    parser.add_argument("--precision", type=str, default="bf16",
+                        choices=["bf16", "int8", "mixed"],
+                        help="Weight precision: bf16, int8, or mixed (default: bf16)")
+    parser.add_argument("--offload", type=int, default=0,
+                        help="Number of experts to keep in CPU memory (0-7). Computation still on GPU.")
+    parser.add_argument("--topk", type=int, default=2,
+                        help="Top-K expert routing (1=top-1, 2=top-2 ensemble, etc.) [default: 2]")
+    parser.add_argument("--visualize", action="store_true",
+                        help="Show expert usage visualization")
+    parser.add_argument("--no-save", action="store_true",
+                        help="Don't save output image (for testing)")
+    return parser.parse_args()
+def print_header():
+    """Print beautiful ASCII header."""
+    print("""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                                                                              ║
+║   ██████╗  █████╗ ██████╗ ██╗███████╗    ███╗   ███╗ ██████╗ ███████╗       ║
+║   ██╔══██╗██╔══██╗██╔══██╗██║██╔════╝    ████╗ ████║██╔═══██╗██╔════╝       ║
+║   ██████╔╝███████║██████╔╝██║███████╗    ██╔████╔██║██║   ██║█████╗         ║
+║   ██╔═══╝ ██╔══██║██╔══██╗██║╚════██║    ██║╚██╔╝██║██║   ██║██╔══╝         ║
+║   ██║     ██║  ██║██║  ██║██║███████║    ██║ ╚═╝ ██║╚██████╔╝███████╗       ║
+║   ╚═╝     ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝╚══════╝    ╚═╝     ╚═╝ ╚═════╝ ╚══════╝       ║
+║                                                                              ║
+║   🎨 Mixture-of-Experts Text-to-Image Diffusion Model                        ║
+║   📊 8× DiT-XL/2 Experts + DiT-B/2 Router (~5B Parameters)                   ║
+║                                                                              ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+    """)
+def print_config(args):
+    """Print configuration summary."""
+    offload_str = f"{args.offload} experts (CPU mem, GPU compute)" if args.offload > 0 else "None"
+    topk_str = f"Top-{args.topk}" if args.topk > 1 else "Top-1"
+    print(f"""
+┌──────────────────────────────────────────────────────────────────────────────┐
+│  📋 Configuration                                                            │
+├──────────────────────────────────────────────────────────────────────────────┤
+│  Prompt:      {args.prompt[:50]:<50}│
+│  Samples:     {args.num_samples:<50}│
+│  Steps:       {args.num_steps:<50}│
+│  CFG Scale:   {args.cfg_scale:<50}│
+│  Precision:   {args.precision.upper():<50}│
+│  Routing:     {topk_str:<50}│
+│  Seed:        {args.seed:<50}│
+│  Offload:     {offload_str:<50}│
+└──────────────────────────────────────────────────────────────────────────────┘
+    """)
+def main():
+    args = parse_args()
+    # Print header
+    print_header()
+    print_config(args)
+    # Set seed
+    torch.manual_seed(args.seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(args.seed)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"🖥️  Using device: {device}")
+    # Load sampler
+    print(f"\n📦 Loading {args.precision.upper()} weights...")
+    start_load = time.time()
+    sampler = load_sampler(
+        precision=args.precision,
+        device=device,
+        offload=args.offload
+    )
+    load_time = time.time() - start_load
+    print(f"⏱️  Model loaded in {load_time:.1f}s")
+    # Generate samples
+    print(f"\n🎨 Generating {args.num_samples} images...")
+    start_gen = time.time()
+    latents = sampler.sample(
+        num_samples=args.num_samples,
+        text_prompts=[args.prompt],
+        cfg_scale=args.cfg_scale,
+        num_steps=args.num_steps,
+        use_bf16=(args.precision == 'bf16'),
+        track_experts=args.visualize,
+        topk=args.topk
+    )
+    gen_time = time.time() - start_gen
+    # Show visualization if requested
+    if args.visualize and sampler.tracker is not None:
+        print(sampler.tracker.get_usage_chart())
+        print(sampler.tracker.get_timeline())
+    # Decode latents
+    print("\n🖼️  Decoding latents...")
+    start_decode = time.time()
+    images = sampler.vae_manager.decode(latents)
+    images = (images + 1.0) / 2.0
+    images = torch.clamp(images, 0, 1)
+    decode_time = time.time() - start_decode
+    # Save output
+    if not args.no_save:
+        output_path = args.output or f"output_{args.precision}.png"
+        nrow = 4 if args.num_samples >= 4 else args.num_samples
+        grid = make_grid(images.cpu(), nrow=nrow, normalize=False, padding=2)
+        save_image(grid, output_path)
+        print(f"\n✅ Saved to: {output_path}")
+    # Print timing summary
+    total_time = load_time + gen_time + decode_time
+    throughput = args.num_samples / gen_time
+    print(f"""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                              ⏱️  Timing Summary ⏱️                             ║
+╠══════════════════════════════════════════════════════════════════════════════╣
+║  Model loading:    {load_time:>6.1f}s                                                ║
+║  Generation:       {gen_time:>6.1f}s ({throughput:.2f} img/s, {gen_time/args.num_steps:.2f}s/step)                    ║
+║  VAE decoding:     {decode_time:>6.1f}s                                                ║
+║  ──────────────────────────────                                              ║
+║  Total:            {total_time:>6.1f}s                                                ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+    """)
+    print("🎉 Done!")
+if __name__ == "__main__":
+    main()

instructions.txt ADDED Viewed

	@@ -0,0 +1 @@

quantize.py ADDED Viewed

	@@ -0,0 +1,435 @@

+#!/usr/bin/env python3
+"""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                                                                              ║
+║   🔧 Paris MoE - Weight Quantization Utility 🔧                              ║
+║                                                                              ║
+║   Converts weights between formats:                                          ║
+║   • Input:  .pt (PyTorch) or .safetensors (F32 or BF16)                     ║
+║   • Output: BF16 or INT8 safetensors                                         ║
+║                                                                              ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+Usage:
+    # Convert original .pt files to BF16 safetensors
+    python quantize.py --input /path/to/weights/ --output ./weights/bf16 --format bf16
+    # Convert to INT8 safetensors
+    python quantize.py --input /path/to/weights/ --output ./weights/int8 --format int8
+    # Convert from existing safetensors (bf16 -> int8)
+    python quantize.py --input ./weights/bf16 --output ./weights/int8 --format int8
+Input Formats Supported:
+    - PyTorch .pt files (original training checkpoints)
+    - SafeTensors .safetensors files (F32 or BF16)
+Output Formats:
+    - bf16: BFloat16 safetensors (best quality, ~1.2GB per expert)
+    - int8: INT8 quantized safetensors (~580MB per expert)
+"""
+import argparse
+import os
+import gc
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+import json
+import torch
+from safetensors.torch import save_file, load_file
+from safetensors import safe_open
+from tqdm import tqdm
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              FILE DETECTION
+# ═══════════════════════════════════════════════════════════════════════════════
+def detect_input_format(input_dir: Path) -> Tuple[str, Dict[str, Path]]:
+    """
+    Detect input format and locate weight files.
+    Returns:
+        format: 'pt' or 'safetensors'
+        files: Dict mapping 'expert_0'..'expert_7', 'router' to file paths
+    """
+    files = {}
+    # Check for PyTorch .pt files (original format)
+    pt_patterns = [
+        # Pattern 1: Full training checkpoint names
+        ("dit_xl2_multi_expert_pretrained_text_new_dataset_expert_{}_best.pt", "expert_{}"),
+        ("laion_router_preclustered_dit_berthead_b2_improved_router_best.pt", "router"),
+        # Pattern 2: Simple names
+        ("expert_{}_best.pt", "expert_{}"),
+        ("expert_{}.pt", "expert_{}"),
+        ("router_best.pt", "router"),
+        ("router.pt", "router"),
+    ]
+    # Check for SafeTensors files
+    st_patterns = [
+        ("expert_{}.safetensors", "expert_{}"),
+        ("router.safetensors", "router"),
+    ]
+    # Try PyTorch patterns first
+    for pattern, key_pattern in pt_patterns:
+        if "{}" in pattern:
+            # Expert pattern
+            for i in range(8):
+                filename = pattern.format(i)
+                filepath = input_dir / filename
+                if filepath.exists():
+                    key = key_pattern.format(i)
+                    files[key] = filepath
+        else:
+            # Router pattern
+            filepath = input_dir / pattern
+            if filepath.exists():
+                files[key_pattern] = filepath
+    if len(files) >= 8:  # At least 8 experts found
+        return 'pt', files
+    # Try SafeTensors patterns
+    files = {}
+    for pattern, key_pattern in st_patterns:
+        if "{}" in pattern:
+            for i in range(8):
+                filename = pattern.format(i)
+                filepath = input_dir / filename
+                if filepath.exists():
+                    key = key_pattern.format(i)
+                    files[key] = filepath
+        else:
+            filepath = input_dir / pattern
+            if filepath.exists():
+                files[key_pattern] = filepath
+    if len(files) >= 8:
+        return 'safetensors', files
+    # List what we found
+    print(f"Found files in {input_dir}:")
+    for f in sorted(input_dir.glob("*")):
+        print(f"  {f.name}")
+    raise ValueError(f"Could not find weight files in {input_dir}")
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              LOADING UTILITIES
+# ═══════════════════════════════════════════════════════════════════════════════
+def load_pt_expert(filepath: Path, expert_id: int) -> Tuple[dict, Optional[object]]:
+    """
+    Load expert weights from PyTorch checkpoint.
+    Returns:
+        state_dict: Model weights
+        config: Config object if available
+    """
+    print(f"  Loading {filepath.name}...")
+    ckpt = torch.load(filepath, map_location='cpu', weights_only=False)
+    # Try EMA weights first (preferred for inference)
+    ema_key = f'expert_{expert_id}_ema_state_dict'
+    regular_key = f'expert_{expert_id}_state_dict'
+    if ema_key in ckpt:
+        state_dict = ckpt[ema_key]
+        print(f"    Using EMA weights")
+    elif regular_key in ckpt:
+        state_dict = ckpt[regular_key]
+        print(f"    Using regular weights (no EMA)")
+    else:
+        # Try to find any state dict key
+        for k in ckpt.keys():
+            if 'state_dict' in k and 'optimizer' not in k:
+                state_dict = ckpt[k]
+                print(f"    Using key: {k}")
+                break
+        else:
+            raise KeyError(f"No state dict found in {filepath}")
+    config = ckpt.get('config', None)
+    return state_dict, config
+def load_pt_router(filepath: Path) -> Tuple[dict, Optional[object]]:
+    """Load router weights from PyTorch checkpoint."""
+    print(f"  Loading {filepath.name}...")
+    ckpt = torch.load(filepath, map_location='cpu', weights_only=False)
+    if 'router_state_dict' in ckpt:
+        state_dict = ckpt['router_state_dict']
+    else:
+        raise KeyError(f"router_state_dict not found in {filepath}")
+    config = ckpt.get('config', None)
+    return state_dict, config
+def load_safetensors_weights(filepath: Path) -> dict:
+    """Load weights from SafeTensors file."""
+    print(f"  Loading {filepath.name}...")
+    return load_file(str(filepath))
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              QUANTIZATION
+# ═══════════════════════════════════════════════════════════════════════════════
+def convert_to_bf16(state_dict: dict) -> dict:
+    """Convert all floating point tensors to bfloat16."""
+    bf16_state = {}
+    for k, v in state_dict.items():
+        if isinstance(v, torch.Tensor) and v.is_floating_point():
+            bf16_state[k] = v.to(torch.bfloat16)
+        else:
+            bf16_state[k] = v
+    return bf16_state
+def is_layernorm_key(key: str) -> bool:
+    """Check if a key belongs to a LayerNorm layer."""
+    ln_patterns = ['norm', 'layernorm', 'layer_norm', 'ln_', 'scale_shift_table']
+    key_lower = key.lower()
+    return any(p in key_lower for p in ln_patterns)
+def quantize_tensor_int8(tensor: torch.Tensor) -> Tuple[torch.Tensor, float, float]:
+    """
+    Quantize a tensor to INT8 with min/max scaling.
+    Formula: int8 = round((x - min) / (max - min) * 255) - 128
+    """
+    if tensor.numel() == 0:
+        return tensor.to(torch.int8), 0.0, 0.0
+    t_float = tensor.float()
+    t_min = t_float.min().item()
+    t_max = t_float.max().item()
+    if t_min == t_max:
+        return torch.zeros_like(tensor, dtype=torch.int8), t_min, t_max
+    # Quantize: map [min, max] to [-128, 127]
+    normalized = (t_float - t_min) / (t_max - t_min)
+    int8_tensor = (normalized * 255 - 128).round().clamp(-128, 127).to(torch.int8)
+    return int8_tensor, t_min, t_max
+def convert_to_int8(state_dict: dict) -> dict:
+    """
+    Convert state dict to INT8 quantized format.
+    LayerNorm and small tensors are kept in float32.
+    Quantization parameters (_min, _max) are stored alongside.
+    """
+    quantized = {}
+    stats = {'float32': 0, 'int8': 0}
+    for key, tensor in state_dict.items():
+        if not isinstance(tensor, torch.Tensor):
+            continue
+        # Skip LayerNorm layers - keep as float32
+        if is_layernorm_key(key):
+            quantized[key] = tensor.float()
+            stats['float32'] += tensor.numel()
+        # Only quantize weight tensors with enough elements
+        elif tensor.numel() >= 16 and tensor.dtype in [torch.float32, torch.float16, torch.bfloat16]:
+            int8_tensor, t_min, t_max = quantize_tensor_int8(tensor)
+            quantized[key] = int8_tensor
+            quantized[f"{key}._min"] = torch.tensor([t_min], dtype=torch.float32)
+            quantized[f"{key}._max"] = torch.tensor([t_max], dtype=torch.float32)
+            stats['int8'] += tensor.numel()
+        else:
+            # Keep small tensors as float32
+            quantized[key] = tensor.float()
+            stats['float32'] += tensor.numel()
+    return quantized, stats
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              MAIN CONVERSION
+# ═══════════════════════════════════════════════════════════════════════════════
+def convert_weights(input_dir: Path, output_dir: Path, output_format: str):
+    """
+    Convert weights to specified format.
+    Args:
+        input_dir: Directory containing input weights
+        output_dir: Directory to write output weights
+        output_format: 'bf16' or 'int8'
+    """
+    print(f"""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                    🔧 Paris MoE Weight Conversion 🔧                         ║
+╠══════════════════════════════════════════════════════════════════════════════╣
+║  Input:  {str(input_dir):<60} ║
+║  Output: {str(output_dir):<60} ║
+║  Format: {output_format.upper():<60} ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+    """)
+    # Detect input format
+    input_format, files = detect_input_format(input_dir)
+    print(f"📂 Detected input format: {input_format}")
+    print(f"📁 Found {len(files)} weight files")
+    # Create output directory
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # Track sizes
+    sizes = {'input': 0, 'output': 0}
+    expert_config = None
+    router_config = None
+    # Process experts
+    print("\n🧠 Converting experts...")
+    for i in range(8):
+        key = f"expert_{i}"
+        if key not in files:
+            print(f"  ⚠️  {key} not found, skipping")
+            continue
+        filepath = files[key]
+        sizes['input'] += filepath.stat().st_size
+        # Load weights
+        if input_format == 'pt':
+            state_dict, config = load_pt_expert(filepath, i)
+            if config is not None and expert_config is None:
+                expert_config = config
+        else:
+            state_dict = load_safetensors_weights(filepath)
+        # Convert
+        if output_format == 'bf16':
+            converted = convert_to_bf16(state_dict)
+        else:
+            converted, stats = convert_to_int8(state_dict)
+            print(f"    INT8: {stats['int8']:,} params, Float32: {stats['float32']:,} params")
+        # Save
+        output_path = output_dir / f"expert_{i}.safetensors"
+        save_file(converted, str(output_path))
+        sizes['output'] += output_path.stat().st_size
+        print(f"  ✅ Saved: {output_path.name} ({output_path.stat().st_size / 1e6:.1f} MB)")
+        # Clean up
+        del state_dict, converted
+        gc.collect()
+    # Process router
+    if 'router' in files:
+        print("\n📡 Converting router...")
+        filepath = files['router']
+        sizes['input'] += filepath.stat().st_size
+        if input_format == 'pt':
+            state_dict, config = load_pt_router(filepath)
+            if config is not None:
+                router_config = config
+        else:
+            state_dict = load_safetensors_weights(filepath)
+        # Router always kept in bf16/float32 for stability
+        converted = convert_to_bf16(state_dict)
+        output_path = output_dir / "router.safetensors"
+        save_file(converted, str(output_path))
+        sizes['output'] += output_path.stat().st_size
+        print(f"  ✅ Saved: {output_path.name} ({output_path.stat().st_size / 1e6:.1f} MB)")
+        del state_dict, converted
+        gc.collect()
+    # Save configs if from .pt files
+    if expert_config is not None:
+        config_path = output_dir / "config.pt"
+        torch.save({'config': expert_config}, config_path)
+        print(f"  ✅ Saved: config.pt")
+    if router_config is not None:
+        config_path = output_dir / "router_config.pt"
+        torch.save({'config': router_config}, config_path)
+        print(f"  ✅ Saved: router_config.pt")
+    # Summary
+    compression = sizes['input'] / sizes['output'] if sizes['output'] > 0 else 1
+    print(f"""
+╔══════════════════════════════════════════════════════════════════════════════╗
+║                           📊 Conversion Summary 📊                           ║
+╠══════════════════════════════════════════════════════════════════════════════╣
+║  Input size:   {sizes['input']/1e9:>8.2f} GB                                           ║
+║  Output size:  {sizes['output']/1e9:>8.2f} GB                                           ║
+║  Compression:  {compression:>8.1f}x                                             ║
+╠══════════════════════════════════════════════════════════════════════════════╣
+║  ✅ Conversion complete!                                                      ║
+╚══════════════════════════════════════════════════════════════════════════════╝
+    """)
+    # List output files
+    print("📁 Output files:")
+    for f in sorted(output_dir.glob("*")):
+        print(f"  {f.name}: {f.stat().st_size/1e6:.1f} MB")
+# ═══════════════════════════════════════════════════════════════════════════════
+#                              CLI
+# ═══════════════════════════════════════════════════════════════════════════════
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="🔧 Paris MoE - Weight Quantization Utility",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Convert original .pt files to BF16
+  python quantize.py --input /path/to/weights --output ./weights/bf16 --format bf16
+  # Convert to INT8 from .pt files
+  python quantize.py --input /path/to/weights --output ./weights/int8 --format int8
+  # Convert from BF16 safetensors to INT8
+  python quantize.py --input ./weights/bf16 --output ./weights/int8 --format int8
+        """
+    )
+    parser.add_argument("--input", "-i", type=str, required=True,
+                        help="Input directory containing weight files")
+    parser.add_argument("--output", "-o", type=str, required=True,
+                        help="Output directory for converted weights")
+    parser.add_argument("--format", "-f", type=str, required=True,
+                        choices=["bf16", "int8"],
+                        help="Output format: bf16 or int8")
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    input_dir = Path(args.input)
+    output_dir = Path(args.output)
+    if not input_dir.exists():
+        print(f"❌ Error: Input directory does not exist: {input_dir}")
+        return 1
+    convert_weights(input_dir, output_dir, args.format)
+    return 0
+if __name__ == "__main__":
+    exit(main())

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+torch>=2.0
+torchvision
+safetensors
+transformers
+diffusers
+accelerate
+tqdm

src/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Paris MoE Inference - Source modules

src/__pycache__/config.cpython-312.pyc ADDED Viewed

Binary file (7.8 kB). View file

src/__pycache__/models.cpython-312.pyc ADDED Viewed

Binary file (90 kB). View file

src/__pycache__/schedules.cpython-312.pyc ADDED Viewed

Binary file (7.41 kB). View file

src/__pycache__/vae_utils.cpython-312.pyc ADDED Viewed

Binary file (8.65 kB). View file

src/config.py ADDED Viewed

	@@ -0,0 +1,199 @@

+# src/config.py
+import yaml
+from typing import Dict, Any, Optional
+from dataclasses import dataclass
+@dataclass
+class Config:
+    """Single config class - no inheritance needed"""
+    # Experiment
+    experiment_name: str
+    seed: int = 42
+    # Dataset
+    dataset_name: str = "cifar10"
+    image_size: int = 64
+    num_channels: Optional[int] = None  # If None, auto-determined based on dataset/latents
+    data_path: str = "./data"
+    download: bool = True
+    use_latents: bool = False  # Whether to use VAE latents instead of raw images
+    latent_data_path: Optional[str] = None  # Path to latent dataset JSON
+    split_strategy: str = "global" # "global" or "per_cluster"
+    preclustered_data_path: Optional[str] = None  # Path to pre-clustered data
+    train_ratio: float = 0.95  # Train/val split ratio
+    # Clustering (None for monolithic)
+    clustering_method: Optional[str] = None  # "manual", "kmeans",  <- note that we dont support dino as an on-the-fly clustering method
+    num_clusters: int = 1
+    manual_mapping: Optional[Dict[int, int]] = None
+    # Model
+    num_experts: int = 1  # 1 = monolithic, >1 = DDM
+    expert_architecture: str = "unet"  # "unet", "dit", "simple_cnn"
+    router_architecture: str = "none"  # "vit", "cnn", "dit", "none"
+    router_pretrained: bool = True
+    clip_tokenizer_name: str = "openai/clip-vit-large-patch14"
+    # Training
+    batch_size: int = 32
+    num_epochs: int = 20
+    learning_rate: float = 1e-4
+    optimizer: str = "adamw"
+    mixed_precision: bool = True
+    num_gpus: int = 1
+    distributed: bool = False
+    train_router_jointly: bool = False
+    weight_decay: float = 0
+    use_lr_scheduler: bool = True
+    warmup_steps: int = 0  # Learning rate warmup steps
+    warmup_factor: float = 0.1  # Learning rate warmup factor
+    grad_accum_steps: int = 1
+    use_amp: bool = True
+    imagenet_pretrain_checkpoint: Optional[str] = None
+    # Cluster imbalance handling
+    use_class_weights: bool = False  # Enable class weighting for imbalanced clusters
+    weight_smoothing: float = 0.0  # Weight smoothing factor (0.0-1.0)
+    # New dataset training options
+    new_dataset_learning_rate: Optional[float] = None
+    reset_optimizer: bool = True
+    reset_scheduler: bool = True
+    reset_epoch: bool = True
+    reset_ema: bool = False
+    # Decentralized training
+    expert_parallel: bool = False
+    target_expert_id: int = 0
+    target_gpu_id: int = 0
+    # FID evaluation
+    compute_fid: bool = False
+    fid_every: int = 5000
+    fid_num_samples: int = 5000
+    fid_batch_size: int = 50
+    # EMA parameters
+    use_ema: bool = True
+    ema_decay: float = 0.9999
+    ema_update_every: int = 1
+    # Heterogeneous objectives
+    expert_objectives: Optional[Dict[int, str]] = None  # {expert_id: "ddpm"|"fm"|"rf"}
+    default_objective: str = "fm"  # Default if expert_objectives not specified
+    # Schedule configuration (NEW)
+    schedule_type: str = "linear_interp"  # Default for backward compatibility
+    expert_schedule_types: Optional[Dict[int, str]] = None  # Per-expert schedules for Strategy B
+    # Consistency loss (NEW)
+    use_consistency_loss: bool = False
+    consistency_loss_weight: float = 0.1
+    # Model parameters (flexible dicts)
+    expert_params: Dict[str, Any] = None
+    router_params: Dict[str, Any] = None
+    video_config: Dict[str, Any] = None  # Video-specific parameters (temporal_frames, latent_height, etc.)
+    # Inference
+    sampling_strategy: str = "top1"  # "top1", "topk", "full", "monolithic"
+    num_inference_steps: int = 50
+    # Diffusion settings
+    beta_start: float = 0.0001
+    beta_end: float = 0.02
+    beta_schedule: str = "linear"
+    max_text_length: int = 77
+    # Paths
+    checkpoint_dir: str = "./outputs/checkpoints"
+    log_dir: str = "./outputs/logs"
+    def __post_init__(self) -> None:
+        # Set defaults for missing fields
+        if self.expert_params is None:
+            self.expert_params = {}
+        if self.router_params is None:
+            self.router_params = {}
+        if self.video_config is None:
+            self.video_config = {}
+        # Auto-determine num_channels if not explicitly set
+        if self.num_channels is None:
+            if self.use_latents:
+                self.num_channels = 4  # VAE latent channels
+            elif self.dataset_name in ["mnist", "fashionmnist"]:
+                self.num_channels = 1
+            else:
+                self.num_channels = 3
+        # Initialize and validate expert_objectives
+        valid_objectives = {"ddpm", "fm", "rf"}
+        # Validate default_objective
+        if self.default_objective not in valid_objectives:
+            raise ValueError(f"default_objective must be one of {valid_objectives}, got {self.default_objective}")
+        # Initialize expert_objectives if None
+        if self.expert_objectives is None:
+            self.expert_objectives = {i: self.default_objective for i in range(self.num_experts)}
+        else:
+            # Validate all objective types
+            for expert_id, obj_type in self.expert_objectives.items():
+                if obj_type not in valid_objectives:
+                    raise ValueError(f"Expert {expert_id} has invalid objective '{obj_type}'. Must be one of {valid_objectives}")
+            # Ensure all expert IDs have objectives assigned
+            for expert_id in range(self.num_experts):
+                if expert_id not in self.expert_objectives:
+                    self.expert_objectives[expert_id] = self.default_objective
+        # Validate schedule types (NEW)
+        valid_schedules = {"cosine", "linear_beta", "linear_interp"}
+        # Validate default schedule_type
+        if self.schedule_type not in valid_schedules:
+            raise ValueError(f"schedule_type must be one of {valid_schedules}, got {self.schedule_type}")
+        # Validate expert_schedule_types if provided
+        if self.expert_schedule_types is not None:
+            for expert_id, sched_type in self.expert_schedule_types.items():
+                if sched_type not in valid_schedules:
+                    raise ValueError(f"Expert {expert_id} has invalid schedule '{sched_type}'. Must be one of {valid_schedules}")
+    @classmethod
+    def from_yaml(cls, config_path: str) -> 'Config':
+        with open(config_path, 'r') as f:
+            config_dict = yaml.safe_load(f)
+        # Set defaults for missing fields
+        config_dict.setdefault('expert_params', {})
+        config_dict.setdefault('router_params', {})
+        # If num_experts is not specified, default to num_clusters (or 1 if num_clusters is not set)
+        if 'num_experts' not in config_dict:
+            num_clusters = config_dict.get('num_clusters', 1)
+            config_dict['num_experts'] = max(1, num_clusters)
+        return cls(**config_dict)
+    @property
+    def is_monolithic(self) -> bool:
+        return self.num_experts == 1
+    @property
+    def num_classes(self) -> int:
+        dataset_classes = {
+            "mnist": 10, "fashionmnist": 10,
+            "cifar10": 10, "cifar100": 100,
+            "celeba": 0,  # No class conditioning
+            "butterfly": 1,  # Single class for butterflies
+            "laion": 0  # No class conditioning for LAION
+        }
+        return dataset_classes.get(self.dataset_name, 10)
+def load_config(config_path: str) -> Config:
+    """Simple config loader"""
+    return Config.from_yaml(config_path)

src/models.py ADDED Viewed

	@@ -0,0 +1,1913 @@

+# src/models.py
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from diffusers import UNet2DModel
+from transformers import ViTForImageClassification, ViTConfig
+import math
+from typing import Optional, List
+import numpy as np
+# =============================================================================
+# TIME EMBEDDING (shared utility)
+# =============================================================================
+class TimeEmbedding(nn.Module):
+    def __init__(self, dim: int) -> None:
+        super().__init__()
+        self.dim = dim
+    def forward(self, t: torch.Tensor) -> torch.Tensor:
+        device = t.device
+        half_dim = self.dim // 2
+        embeddings = math.log(10000) / (half_dim - 1)
+        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
+        embeddings = t[:, None] * embeddings[None, :]
+        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
+        return embeddings
+class DiTTimestepEmbedder(nn.Module):
+    def __init__(self, hidden_size, freq_dim=128, max_period=10000):
+        super().__init__()
+        self.freq_dim = freq_dim
+        self.max_period = max_period
+        self.mlp = nn.Sequential(
+            nn.Linear(2*freq_dim, hidden_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(hidden_size, hidden_size, bias=True),
+        )
+    def forward(self, t):  # t: [B] integers (float tensor ok)
+        # standard "timestep_embedding" (like ADM/DiT)
+        half = self.freq_dim
+        device = t.device
+        # positions in radians
+        freqs = torch.exp(
+            -torch.arange(half, device=device).float() * np.log(self.max_period) / half
+        )
+        args = t.float()[:, None] * freqs[None]  # [B, half]
+        emb = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)  # [B, 2*half]
+        return self.mlp(emb)
+# =============================================================================
+# OUTPUT CONVERTER (for heterogeneous objectives)
+# =============================================================================
+class OutputConverter(nn.Module):
+    def __init__(self, schedule_type: str = 'linear_interp', use_latents: bool = False, derivative_eps: float = 1e-4):
+        super().__init__()
+        from schedules import NoiseSchedule
+        self.schedule = NoiseSchedule(schedule_type)
+        self.schedule_type = schedule_type
+        self.use_latents = use_latents
+        self.derivative_eps = derivative_eps  # For finite difference derivatives
+        # Set clamping range based on data type
+        # VAE latents have larger range than pixel-space images
+        self.clamp_range = 20.0 if use_latents else 5.0
+    def _get_schedule_with_derivatives(self, t: torch.Tensor):
+        """
+        Compute schedule coefficients and their derivatives.
+        Essential for correct velocity computation with any schedule.
+        """
+        # Get coefficients at current time
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        # Compute derivatives using finite differences
+        h = torch.full_like(t, self.derivative_eps)
+        t_plus = (t + h).clamp(0.0, 1.0)
+        t_minus = (t - h).clamp(0.0, 1.0)
+        alpha_plus, sigma_plus = self.schedule.get_schedule(t_plus)
+        alpha_minus, sigma_minus = self.schedule.get_schedule(t_minus)
+        # Derivatives
+        dt = (t_plus - t_minus).clamp(min=1e-6)
+        d_alpha_dt = (alpha_plus - alpha_minus) / dt
+        d_sigma_dt = (sigma_plus - sigma_minus) / dt
+        return alpha_t, sigma_t, d_alpha_dt, d_sigma_dt
+    def epsilon_to_velocity(self, epsilon_pred: torch.Tensor, x_t: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
+        """
+        Correct ε→v conversion for ANY schedule using proper derivatives.
+        From ODE: dx_t/dt = d(alpha_t)/dt * x_0 + d(sigma_t)/dt * ε
+        This is the TRUE velocity for the schedule!
+        """
+        # Get schedule coefficients AND their derivatives
+        alpha_t, sigma_t, d_alpha_dt, d_sigma_dt = self._get_schedule_with_derivatives(t)
+        # Reshape for broadcasting
+        alpha_t = alpha_t.view(-1, 1, 1, 1)
+        sigma_t = sigma_t.view(-1, 1, 1, 1)
+        d_alpha_dt = d_alpha_dt.view(-1, 1, 1, 1)
+        d_sigma_dt = d_sigma_dt.view(-1, 1, 1, 1)
+        # Numerical stability: handle small alpha_t
+        alpha_safe = torch.clamp(alpha_t, min=0.01)
+        # Step 1: Recover x_0 using Tweedie's formula
+        x_0_pred = (x_t - sigma_t * epsilon_pred) / alpha_safe
+        # Step 2: Clamp x_0 to reasonable range (prevents blow-up)
+        # Use adaptive clamping: larger range for VAE latents, tighter for pixel space
+        x_0_pred = torch.clamp(x_0_pred, -self.clamp_range, self.clamp_range)
+        # Step 3: Compute velocity based on schedule type
+        if self.schedule_type == 'linear_interp':
+            # For linear interpolation: x_t = (1-t)*x_0 + t*ε
+            # Velocity is simply: v = ε - x_0
+            v = epsilon_pred - x_0_pred
+        else:
+            # For cosine and other schedules: use proper derivatives
+            # v = d(alpha_t)/dt * x_0 + d(sigma_t)/dt * ε
+            v = d_alpha_dt * x_0_pred + d_sigma_dt * epsilon_pred
+            # Adaptive velocity scaling for cosine schedule
+            # Derivatives vary dramatically with timestep - need adaptive dampening
+            if self.schedule_type == 'cosine':
+                t_val = t[0].item() if t.numel() > 0 else 0.5
+                if t_val > 0.85:
+                    # Very high noise: derivatives are large, need dampening
+                    scale = 0.88
+                elif t_val > 0.6:
+                    # Medium-high noise: moderate dampening
+                    scale = 0.93
+                else:
+                    # Low to medium noise: slight dampening
+                    scale = 0.96
+                v = v * scale
+                # Per-channel bias correction to prevent color drift
+                # The model has inherent channel bias that gets amplified by integration
+                # Remove per-channel mean to prevent accumulation
+                # Only apply to color channels (1,2,3), preserve luminance channel (0)
+                for c in range(1, 4):
+                    v[:, c] = v[:, c] - v[:, c].mean()
+        return v
+    def convert(self, prediction: torch.Tensor, objective_type: str, x_t: torch.Tensor, t: torch.Tensor):
+        """
+        Convert any prediction to velocity space.
+        Args:
+            prediction: expert output
+            objective_type: 'ddpm' | 'fm' | 'rf'
+            x_t: current noisy state
+            t: current timesteps
+        Returns:
+            v: velocity representation
+        """
+        if objective_type == "ddpm":
+            # Proper ε→v conversion for unified integration
+            return self.epsilon_to_velocity(prediction, x_t, t)
+        elif objective_type in ["fm", "rf"]:
+            return prediction  # Already velocity
+        else:
+            raise ValueError(f"Unknown objective type: {objective_type}")
+# =============================================================================
+# EXPERT MODELS
+# =============================================================================
+class UNetExpert(nn.Module):
+    """UNet expert using diffusers"""
+    def __init__(self, config) -> None:
+        super().__init__()
+        # Default UNet params
+        default_params = {
+            "sample_size": config.image_size,
+            "in_channels": config.num_channels,
+            "out_channels": config.num_channels,
+            "layers_per_block": 2,
+            "block_out_channels": [64, 128, 256, 256],
+            "attention_head_dim": 8,
+        }
+        # Override with config params
+        params = {**default_params, **config.expert_params}
+        # Store objective type for heterogeneous training (and remove from params)
+        self.objective_type = params.pop("objective_type", "fm")
+        # Store and initialize schedule (NEW)
+        schedule_type = params.pop("schedule_type", "linear_interp")
+        from schedules import NoiseSchedule
+        self.schedule = NoiseSchedule(schedule_type)
+        self.unet = UNet2DModel(**params)
+    def forward(self, xt: torch.Tensor, t: torch.Tensor, **kwargs) -> torch.Tensor:
+        # Scale timesteps for diffusers (expects 0-1000)
+        # t_scaled = (t * 1000).long()
+        t_scaled = (t * 999).round().long().clamp(0, 999)
+        return self.unet(xt, t_scaled).sample
+    def compute_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """Unified loss computation based on objective type"""
+        if self.objective_type == "ddpm":
+            return self.ddpm_loss(x0)
+        elif self.objective_type == "fm":
+            return self.flow_matching_loss(x0)
+        elif self.objective_type == "rf":
+            return self.rectified_flow_loss(x0)
+        else:
+            raise ValueError(f"Unknown objective type: {self.objective_type}")
+    def ddpm_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """DDPM: predict noise ε"""
+        batch_size = x0.shape[0]
+        device = x0.device
+        t = torch.rand(batch_size, device=device)
+        # Use proper schedule (NEW)
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        noise = torch.randn_like(x0)
+        xt = alpha_t.view(-1, 1, 1, 1) * x0 + sigma_t.view(-1, 1, 1, 1) * noise
+        pred_eps = self.forward(xt, t)
+        return F.mse_loss(pred_eps, noise)
+    def rectified_flow_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """Rectified Flow: predict velocity v = x_1 - x_0"""
+        batch_size = x0.shape[0]
+        device = x0.device
+        t = torch.rand(batch_size, device=device)
+        x1 = torch.randn_like(x0)
+        xt = (1 - t).view(-1, 1, 1, 1) * x0 + t.view(-1, 1, 1, 1) * x1
+        pred_v = self.forward(xt, t)
+        true_v = x1 - x0
+        return F.mse_loss(pred_v, true_v)
+    def flow_matching_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """Flow matching loss for training"""
+        batch_size = x0.shape[0]
+        device = x0.device
+        # Sample random timesteps
+        t = torch.rand(batch_size, device=device)
+        # Use proper schedule (NEW)
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        # Add noise
+        noise = torch.randn_like(x0)
+        xt = alpha_t.view(-1, 1, 1, 1) * x0 + sigma_t.view(-1, 1, 1, 1) * noise
+        # Predict velocity
+        pred_v = self.forward(xt, t)
+        # True velocity for flow matching
+        # true_v = x0 - xt
+        true_v = noise - x0
+        return F.mse_loss(pred_v, true_v)
+class SimpleCNNExpert(nn.Module):
+    """Simple CNN expert for fast training"""
+    def __init__(self, config) -> None:
+        super().__init__()
+        # Default params
+        default_params = {
+            "hidden_dims": [64, 128, 256],
+            "time_dim": 64,
+        }
+        params = {**default_params, **config.expert_params}
+        # Store objective type for heterogeneous training
+        self.objective_type = params.get("objective_type", "fm")
+        # Store and initialize schedule (NEW)
+        schedule_type = params.get("schedule_type", "linear_interp")
+        from schedules import NoiseSchedule
+        self.schedule = NoiseSchedule(schedule_type)
+        self.time_embedding = TimeEmbedding(params["time_dim"])
+        self.target_size = config.image_size
+        # Simple encoder-decoder
+        self.encoder = self._build_encoder(config.num_channels, params["hidden_dims"])
+        self.decoder = self._build_decoder(params["hidden_dims"], config.num_channels)
+        # Time conditioning
+        self.time_mlp = nn.Sequential(
+            nn.Linear(params["time_dim"], params["hidden_dims"][-1]),
+            nn.SiLU(),
+            nn.Linear(params["hidden_dims"][-1], params["hidden_dims"][-1])
+        )
+    def _build_encoder(self, in_channels: int, hidden_dims: List[int]) -> nn.Sequential:
+        layers = []
+        prev_dim = in_channels
+        for dim in hidden_dims:
+            layers.extend([
+                nn.Conv2d(prev_dim, dim, 3, padding=1),
+                nn.GroupNorm(8, dim),
+                nn.SiLU(),
+                nn.Conv2d(dim, dim, 3, padding=1),
+                nn.GroupNorm(8, dim),
+                nn.SiLU(),
+                nn.MaxPool2d(2)
+            ])
+            prev_dim = dim
+        return nn.Sequential(*layers)
+    def _build_decoder(self, hidden_dims: List[int], out_channels: int) -> nn.Sequential:
+        layers = []
+        reversed_dims = list(reversed(hidden_dims))
+        for i, dim in enumerate(reversed_dims[:-1]):
+            next_dim = reversed_dims[i + 1]
+            layers.extend([
+                nn.ConvTranspose2d(dim, next_dim, 4, stride=2, padding=1),
+                nn.GroupNorm(8, next_dim),
+                nn.SiLU(),
+                nn.Conv2d(next_dim, next_dim, 3, padding=1),
+                nn.GroupNorm(8, next_dim),
+                nn.SiLU(),
+            ])
+        # Final layer
+        layers.append(nn.Conv2d(reversed_dims[-1], out_channels, 3, padding=1))
+        return nn.Sequential(*layers)
+    def forward(self, xt: torch.Tensor, t: torch.Tensor, **kwargs) -> torch.Tensor:
+        # Time embedding
+        time_emb = self.time_embedding(t)
+        time_features = self.time_mlp(time_emb)
+        # Encode
+        encoded = self.encoder(xt)
+        # Add time conditioning
+        time_features = time_features.view(time_features.shape[0], -1, 1, 1)
+        time_features = time_features.expand(-1, -1, encoded.shape[2], encoded.shape[3])
+        conditioned = encoded + time_features
+        # Decode
+        output = self.decoder(conditioned)
+        # Ensure output matches target size
+        output = F.interpolate(output, size=xt.shape[-2:], mode='bilinear', align_corners=False)
+        return output
+    def compute_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """Unified loss computation based on objective type"""
+        if self.objective_type == "ddpm":
+            return self.ddpm_loss(x0)
+        elif self.objective_type == "fm":
+            return self.flow_matching_loss(x0)
+        elif self.objective_type == "rf":
+            return self.rectified_flow_loss(x0)
+        else:
+            raise ValueError(f"Unknown objective type: {self.objective_type}")
+    def ddpm_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """DDPM: predict noise ε"""
+        batch_size = x0.shape[0]
+        device = x0.device
+        t = torch.rand(batch_size, device=device)
+        # Use proper schedule (NEW)
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        noise = torch.randn_like(x0)
+        xt = alpha_t.view(-1, 1, 1, 1) * x0 + sigma_t.view(-1, 1, 1, 1) * noise
+        pred_eps = self.forward(xt, t)
+        # Ensure pred_eps matches noise shape
+        if pred_eps.shape != noise.shape:
+            pred_eps = F.interpolate(pred_eps, size=noise.shape[-2:], mode='bilinear', align_corners=False)
+        return F.mse_loss(pred_eps, noise)
+    def rectified_flow_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """Rectified Flow: predict velocity v = x_1 - x_0"""
+        batch_size = x0.shape[0]
+        device = x0.device
+        t = torch.rand(batch_size, device=device)
+        x1 = torch.randn_like(x0)
+        xt = (1 - t).view(-1, 1, 1, 1) * x0 + t.view(-1, 1, 1, 1) * x1
+        pred_v = self.forward(xt, t)
+        true_v = x1 - x0
+        # Ensure pred_v matches true_v shape
+        if pred_v.shape != true_v.shape:
+            pred_v = F.interpolate(pred_v, size=true_v.shape[-2:], mode='bilinear', align_corners=False)
+        return F.mse_loss(pred_v, true_v)
+    def flow_matching_loss(self, x0: torch.Tensor) -> torch.Tensor:
+        """Flow matching loss"""
+        batch_size = x0.shape[0]
+        device = x0.device
+        t = torch.rand(batch_size, device=device)
+        # Use proper schedule (NEW)
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        noise = torch.randn_like(x0)
+        xt = alpha_t.view(-1, 1, 1, 1) * x0 + sigma_t.view(-1, 1, 1, 1) * noise
+        pred_v = self.forward(xt, t)
+        # true_v = x0 - xt
+        true_v = noise - x0
+        # Ensure pred_v matches true_v shape
+        if pred_v.shape != true_v.shape:
+            pred_v = F.interpolate(pred_v, size=true_v.shape[-2:], mode='bilinear', align_corners=False)
+        return F.mse_loss(pred_v, true_v)
+# Helper function from original DiT
+def modulate(x, shift, scale):
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+# Fixed sin-cos position embedding from original
+def get_2d_sincos_pos_embed(embed_dim, grid_size):
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)
+    grid = np.stack(grid, axis=0)
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    assert embed_dim % 2 == 0
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])
+    emb = np.concatenate([emb_h, emb_w], axis=1)
+    return emb
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000**omega
+    pos = pos.reshape(-1)
+    out = np.einsum('m,d->md', pos, omega)
+    emb_sin = np.sin(out)
+    emb_cos = np.cos(out)
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)
+    return emb
+# Timestep Embedder
+class TimestepEmbedder(nn.Module):
+    def __init__(self, hidden_size: int, frequency_embedding_size: int = 256):
+        super().__init__()
+        self.frequency_embedding_size = frequency_embedding_size
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(hidden_size, hidden_size, bias=True),
+        )
+    @staticmethod
+    def timestep_embedding(t, dim, max_period=10000):
+        half = dim // 2
+        freqs = torch.exp(-math.log(max_period) * torch.arange(0, half, dtype=torch.float32, device=t.device) / half)
+        args = t[:, None].float() * freqs[None]
+        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+        if dim % 2:
+            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
+        return embedding
+    def forward(self, t: torch.Tensor) -> torch.Tensor:
+        t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
+        return self.mlp(t_freq)
+# DiTBlock with proper AdaLN-Zero
+class DiTBlock(nn.Module):
+    def __init__(self, hidden_size: int, num_heads: int, mlp_ratio: float = 4.0, use_text: bool = False, use_adaln_single: bool = False):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = nn.MultiheadAttention(hidden_size, num_heads, dropout=0.1, batch_first=True)
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        self.mlp = nn.Sequential(
+            nn.Linear(hidden_size, mlp_hidden_dim),
+            nn.GELU(approximate="tanh"),  # Match original
+            nn.Linear(mlp_hidden_dim, hidden_size),
+        )
+        # AdaLN modulation - either per-block MLP or AdaLN-Single embeddings
+        self.use_adaln_single = use_adaln_single
+        if use_adaln_single:
+            # AdaLN-Single: use learnable per-block embeddings instead of MLP
+            self.scale_shift_table = nn.Parameter(torch.randn(6, hidden_size) / hidden_size ** 0.5)
+            self.adaLN_modulation = None  # No MLP needed
+        else:
+            # Original AdaLN with per-block MLP
+            self.adaLN_modulation = nn.Sequential(
+                nn.SiLU(),
+                nn.Linear(hidden_size, 6 * hidden_size, bias=True)
+            )
+            self.scale_shift_table = None
+        # Optional text cross-attention
+        self.use_text = use_text
+        if use_text:
+            # Note: PixArt uses xformers which may handle unnormalized queries differently
+            # We add a simple norm for stability with PyTorch's MultiheadAttention
+            self.norm_cross = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+            self.cross_attn = nn.MultiheadAttention(hidden_size, num_heads, dropout=0.1, batch_first=True)
+    def forward(self, x: torch.Tensor, c: torch.Tensor, text_emb: Optional[torch.Tensor] = None,
+                attention_mask: Optional[torch.Tensor] = None):
+        # Get modulation parameters
+        if self.use_adaln_single:
+            # AdaLN-Single: combine global time embedding with per-block parameters
+            # c should be pre-computed from global t_block with shape [B, 6*hidden_size]
+            B = x.shape[0]
+            # Chunk and squeeze to get [B, hidden_size] tensors for compatibility with PyTorch's MultiheadAttention
+            temp = (self.scale_shift_table[None] + c.reshape(B, 6, -1)).chunk(6, dim=1)
+            shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = [t.squeeze(1) for t in temp]
+        else:
+            # Original AdaLN: compute modulation from per-block MLP
+            # Also squeeze after chunk to get [B, hidden_size] for consistency
+            temp = self.adaLN_modulation(c).chunk(6, dim=1)
+            shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = [t.squeeze(1) for t in temp]
+        # Self-attention with modulation
+        # Both paths now use modulate function for consistency
+        x_norm = modulate(self.norm1(x), shift_msa, scale_msa)
+        attn_out, _ = self.attn(x_norm, x_norm, x_norm)
+        x = x + gate_msa.unsqueeze(1) * attn_out
+        # Optional cross-attention
+        if self.use_text and text_emb is not None:
+            if text_emb.dim() == 2:
+                text_emb = text_emb.unsqueeze(1)
+            # Convert attention mask to key_padding_mask format (True = ignore)
+            # attention_mask: shape [B, T]; either bool (True=keep) or 0/1 numeric (1=keep)
+            key_padding_mask = None
+            if attention_mask is not None:
+                if attention_mask.dtype is not torch.bool:
+                    # Convert 0/1 (or >=1) to bool keep-mask first
+                    keep_mask = attention_mask > 0
+                else:
+                    keep_mask = attention_mask
+                # key_padding_mask semantics: True = ignore, False = keep
+                key_padding_mask = ~keep_mask  # logical NOT, not arithmetic subtraction
+            # Normalize queries for stability (PixArt uses xformers which may differ)
+            x_norm = self.norm_cross(x)
+            cross_out, _ = self.cross_attn(x_norm, text_emb, text_emb, key_padding_mask=key_padding_mask)
+            x = x + cross_out
+        # MLP with modulation
+        # Both paths now use modulate function for consistency
+        x_norm = modulate(self.norm2(x), shift_mlp, scale_mlp)
+        mlp_out = self.mlp(x_norm)
+        x = x + gate_mlp.unsqueeze(1) * mlp_out
+        return x
+# FinalLayer with AdaLN modulation
+class FinalLayer(nn.Module):
+    def __init__(self, hidden_size: int, patch_size: int, out_channels: int):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(hidden_size, 2 * hidden_size, bias=True)
+        )
+    def forward(self, x: torch.Tensor, c: torch.Tensor):
+        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
+        x = modulate(self.norm_final(x), shift, scale)
+        x = self.linear(x)
+        return x
+# T2IFinalLayer with AdaLN-Single for parameter efficiency
+class T2IFinalLayer(nn.Module):
+    def __init__(self, hidden_size: int, patch_size: int, out_channels: int):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(hidden_size, patch_size * patch_size * out_channels, bias=True)
+        # AdaLN-Single: use learnable embeddings instead of MLP
+        self.scale_shift_table = nn.Parameter(torch.randn(2, hidden_size) / hidden_size ** 0.5)
+        self.hidden_size = hidden_size
+    def forward(self, x: torch.Tensor, t: torch.Tensor):
+        # t should be the original time embedding with shape [B, hidden_size]
+        # Following PixArt implementation exactly
+        shift, scale = (self.scale_shift_table[None] + t[:, None]).chunk(2, dim=1)
+        # shift and scale are [B, 1, hidden_size], use t2i_modulate style
+        x = self.norm_final(x) * (1 + scale) + shift
+        x = self.linear(x)
+        return x
+# DiTExpert
+class DiTExpert(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        default_params = {
+            "hidden_size": 768,
+            "num_layers": 12,
+            "num_heads": 12,
+            "patch_size": 2,
+            "in_channels": 4,
+            "out_channels": 4,
+            "use_text_conditioning": False,
+            "use_class_conditioning": False,
+            "num_classes": 1000,  # ImageNet classes
+            "mlp_ratio": 4.0,
+            "text_embed_dim": 768,
+            "use_dit_time_embed": False,
+        }
+        params = {**default_params, **config.expert_params}
+        self.patch_size = params["patch_size"]
+        self.in_channels = params["in_channels"]
+        self.out_channels = params["out_channels"]
+        self.hidden_size = params["hidden_size"]
+        self.num_heads = params["num_heads"]
+        self.use_text = params.get("use_text_conditioning", False)
+        self.use_class = params.get("use_class_conditioning", False)
+        self.cfg_dropout_prob = params.get("cfg_dropout_prob", 0.1)  # 10% dropout for CFG
+        self.text_embed_dim = params.get("text_embed_dim", 768)
+        self.use_adaln_single = params.get("use_adaln_single", False)  # AdaLN-Single for parameter efficiency
+        self.depth = params["num_layers"]
+        # Store objective type for heterogeneous training
+        self.objective_type = params.get("objective_type", "fm")
+        # Store and initialize schedule (NEW)
+        schedule_type = params.get("schedule_type", "linear_interp")
+        from schedules import NoiseSchedule
+        self.schedule = NoiseSchedule(schedule_type)
+        # Validation: cannot use both text and class conditioning simultaneously
+        assert not (self.use_text and self.use_class), "Cannot use both text and class conditioning simultaneously"
+        # Patch embedding
+        self.patch_embed = nn.Conv2d(self.in_channels, self.hidden_size,
+                                     kernel_size=self.patch_size, stride=self.patch_size)
+        # Fixed sin-cos positional embedding
+        latent_size = getattr(config, 'image_size', 32)
+        self.num_patches = (latent_size // self.patch_size) ** 2
+        self.pos_embed = nn.Parameter(torch.zeros(1, self.num_patches, self.hidden_size), requires_grad=False)
+        # Time embedding
+        self.use_dit_time_embed = params.get("use_dit_time_embed", False)
+        if self.use_dit_time_embed:
+            self.time_embed = DiTTimestepEmbedder(self.hidden_size)
+        else:
+            self.time_embed = TimestepEmbedder(self.hidden_size)
+        # Global time block for AdaLN-Single
+        if self.use_adaln_single:
+            self.t_block = nn.Sequential(
+                nn.SiLU(),
+                nn.Linear(self.hidden_size, 6 * self.hidden_size, bias=True)
+            )
+        # Optional text conditioning
+        if self.use_text:
+            self.text_proj = nn.Linear(self.text_embed_dim, self.hidden_size)
+            self.text_norm = nn.LayerNorm(self.hidden_size, elementwise_affine=False, eps=1e-6)
+            # Note: null text embedding will be provided by empty string encoding from CLIP
+            # This is handled in the training loop, not as a learnable parameter
+        # Optional class conditioning (ImageNet style)
+        if self.use_class:
+            # Add 1 extra embedding for null/unconditional class
+            self.class_embed = nn.Embedding(params["num_classes"] + 1, self.hidden_size)
+            self.null_class_id = params["num_classes"]  # Use last index as null class
+        # Transformer blocks
+        self.layers = nn.ModuleList([
+            DiTBlock(self.hidden_size, self.num_heads, params.get("mlp_ratio", 4.0),
+                    self.use_text, use_adaln_single=self.use_adaln_single)
+            for _ in range(self.depth)
+        ])
+        # Final layer with modulation
+        if self.use_adaln_single:
+            self.final_layer = T2IFinalLayer(self.hidden_size, self.patch_size, self.out_channels)
+        else:
+            self.final_layer = FinalLayer(self.hidden_size, self.patch_size, self.out_channels)
+        # Initialize weights
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Initialize transformer layers
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize positional embedding with sin-cos
+        grid_size = int(self.num_patches ** 0.5)
+        pos_embed = get_2d_sincos_pos_embed(self.pos_embed.shape[-1], grid_size)
+        self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize patch_embed like nn.Linear
+        w = self.patch_embed.weight.data
+        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+        if self.patch_embed.bias is not None:
+            nn.init.constant_(self.patch_embed.bias, 0)
+        # Initialize timestep embedding MLP
+        nn.init.normal_(self.time_embed.mlp[0].weight, std=0.02)
+        nn.init.normal_(self.time_embed.mlp[2].weight, std=0.02)
+        # Zero-out adaLN modulation layers in DiT blocks (from DiT paper)
+        for block in self.layers:
+            if block.adaLN_modulation is not None:
+                # Original AdaLN mode
+                nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
+                nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
+            # AdaLN-Single mode: scale_shift_table is already initialized with randn/sqrt(hidden_size)
+            # Zero-out cross-attention output projection (from PixArt-Alpha)
+            if self.use_text and hasattr(block, 'cross_attn'):
+                nn.init.constant_(block.cross_attn.out_proj.weight, 0)
+                nn.init.constant_(block.cross_attn.out_proj.bias, 0)
+        # Initialize text projection layer (analogous to PixArt's caption embedding)
+        if self.use_text and hasattr(self, 'text_proj'):
+            nn.init.normal_(self.text_proj.weight, std=0.02)
+            if self.text_proj.bias is not None:
+                nn.init.constant_(self.text_proj.bias, 0)
+        # Initialize class embedding layer (similar to DiT paper)
+        if self.use_class and hasattr(self, 'class_embed'):
+            nn.init.normal_(self.class_embed.weight, std=0.02)
+        # Initialize global t_block for AdaLN-Single
+        if self.use_adaln_single and hasattr(self, 't_block'):
+            nn.init.normal_(self.t_block[1].weight, std=0.02)
+            # Zero-out t_block initially for stability
+            nn.init.constant_(self.t_block[1].bias, 0)
+        # Zero-out output layers
+        if hasattr(self.final_layer, 'adaLN_modulation') and self.final_layer.adaLN_modulation is not None:
+            # Original FinalLayer
+            nn.init.constant_(self.final_layer.adaLN_modulation[-1].weight, 0)
+            nn.init.constant_(self.final_layer.adaLN_modulation[-1].bias, 0)
+        # T2IFinalLayer scale_shift_table is already initialized with randn/sqrt(hidden_size)
+        nn.init.constant_(self.final_layer.linear.weight, 0)
+        nn.init.constant_(self.final_layer.linear.bias, 0)
+    def forward(self, xt: torch.Tensor, t: torch.Tensor, text_embeds: Optional[torch.Tensor] = None,
+                attention_mask: Optional[torch.Tensor] = None, class_labels: Optional[torch.Tensor] = None, **kwargs) -> torch.Tensor:
+        B, C, H, W = xt.shape
+        # Handle timestep scaling - DiT expects timesteps in [0, 999] range
+        # If t is normalized (in [0, 1]), scale it to [0, 999]
+        if t.max() <= 1.0 and t.min() >= 0.0:
+            # Normalized timesteps, scale to DiT range
+            t = t * 999.0
+        # Ensure t is in correct range for DiT
+        t = t.clamp(0, 999)
+        # Patchify
+        x = self.patch_embed(xt)  # [B, hidden_size, H//p, W//p]
+        x = x.flatten(2).transpose(1, 2)  # [B, num_patches, hidden_size]
+        x = x + self.pos_embed  # Add positional embedding
+        # Prepare conditioning
+        time_emb = self.time_embed(t)  # [B, hidden_size]
+        # Add class conditioning to time embedding if using class conditioning
+        if self.use_class and class_labels is not None:
+            class_emb = self.class_embed(class_labels)  # [B, hidden_size]
+            time_emb = time_emb + class_emb  # Additive combination following DiT paper
+        # Process conditioning based on AdaLN mode
+        if self.use_adaln_single:
+            # AdaLN-Single: compute global modulation once
+            c = self.t_block(time_emb)  # [B, 6*hidden_size]
+        else:
+            # Original AdaLN: pass time embedding to each block
+            c = time_emb
+        # Prepare text tokens for cross-attention (not fused with time)
+        text_tokens = None
+        if self.use_text and text_embeds is not None:
+            if text_embeds.dim() == 3:
+                text_tokens = self.text_proj(text_embeds)  # [B, T, hidden_size]
+                text_tokens = self.text_norm(text_tokens)
+            else:
+                text_tokens = self.text_proj(text_embeds).unsqueeze(1)  # [B, 1, hidden_size]
+                text_tokens = self.text_norm(text_tokens)
+            if attention_mask is not None:
+                # cast to bool, clamp shapes to text_tokens length
+                attention_mask = attention_mask[:, :text_tokens.shape[1]].to(torch.bool)
+                # safety: avoid all-false rows (would yield NaNs in softmax)
+                all_false = attention_mask.sum(dim=1) == 0
+                if all_false.any():
+                    attention_mask[all_false, 0] = True
+        # Apply transformer blocks
+        for layer in self.layers:
+            x = layer(x, c, text_tokens, attention_mask)
+        # Final projection
+        if self.use_adaln_single:
+            # T2IFinalLayer expects original time embedding, not global modulation
+            x = self.final_layer(x, time_emb)  # [B, num_patches, patch_size^2 * out_channels]
+        else:
+            # Original FinalLayer expects conditioning
+            x = self.final_layer(x, c)  # [B, num_patches, patch_size^2 * out_channels]
+        # Unpatchify
+        patch_h = patch_w = int(self.num_patches ** 0.5)
+        x = x.view(B, patch_h, patch_w, self.patch_size, self.patch_size, self.out_channels)
+        x = x.permute(0, 5, 1, 3, 2, 4).contiguous()
+        x = x.view(B, self.out_channels, H, W)
+        return x
+    def compute_loss(self, x0: torch.Tensor, text_embeds: Optional[torch.Tensor] = None,
+                     attention_mask: Optional[torch.Tensor] = None, class_labels: Optional[torch.Tensor] = None,
+                     null_text_embeds: Optional[torch.Tensor] = None, null_attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Unified loss computation based on objective type"""
+        if self.objective_type == "ddpm":
+            return self.ddpm_loss(x0, text_embeds, attention_mask, class_labels, null_text_embeds, null_attention_mask)
+        elif self.objective_type == "fm":
+            return self.flow_matching_loss(x0, text_embeds, attention_mask, class_labels, null_text_embeds, null_attention_mask)
+        elif self.objective_type == "rf":
+            return self.rectified_flow_loss(x0, text_embeds, attention_mask, class_labels, null_text_embeds, null_attention_mask)
+        else:
+            raise ValueError(f"Unknown objective type: {self.objective_type}")
+    def ddpm_loss(self, x0: torch.Tensor, text_embeds: Optional[torch.Tensor] = None,
+                  attention_mask: Optional[torch.Tensor] = None, class_labels: Optional[torch.Tensor] = None,
+                  null_text_embeds: Optional[torch.Tensor] = None, null_attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """DDPM: predict noise ε"""
+        B = x0.shape[0]
+        device = x0.device
+        # Sample time uniformly
+        t = torch.rand(B, device=device)
+        # Use proper schedule (NEW)
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        noise = torch.randn_like(x0)
+        xt = alpha_t.view(-1, 1, 1, 1) * x0 + sigma_t.view(-1, 1, 1, 1) * noise
+        # Apply CFG dropout during training
+        if self.training and self.cfg_dropout_prob > 0:
+            if self.use_text and text_embeds is not None:
+                keep = (torch.rand(B, device=device) > self.cfg_dropout_prob)  # True = keep text
+                if null_text_embeds is not None:
+                    # Use provided null text embeddings (from empty string CLIP encoding)
+                    if null_text_embeds.shape[0] == 1:
+                        null_text_embeds = null_text_embeds.expand(B, -1, -1)
+                    # Replace dropped samples with null text embeddings
+                    dropped = ~keep
+                    if dropped.any():
+                        text_embeds = text_embeds.clone()
+                        text_embeds[dropped] = null_text_embeds[dropped]
+                        # Use provided null attention mask or create default for empty string
+                        if attention_mask is not None:
+                            attention_mask = attention_mask.clone()
+                            if null_attention_mask is not None:
+                                if null_attention_mask.shape[0] == 1:
+                                    null_attention_mask = null_attention_mask.expand(B, -1)
+                                attention_mask[dropped] = null_attention_mask[dropped]
+                            else:
+                                attention_mask[dropped] = 0
+                                attention_mask[dropped, 0] = 1
+                else:
+                    # Fallback to old zeroing approach if null_text_embeds not provided
+                    if text_embeds.dim() == 3:   # [B, T, D]
+                        text_embeds = text_embeds * keep[:, None, None].to(text_embeds.dtype)
+                    else:                        # [B, D]
+                        text_embeds = text_embeds * keep[:, None].to(text_embeds.dtype)
+                    if attention_mask is not None:
+                        attention_mask = attention_mask.clone()
+                        dropped = ~keep
+                        if dropped.any():
+                            attention_mask[dropped, 0] = 1
+            elif self.use_class and class_labels is not None:
+                # Apply CFG dropout to class labels using null class embedding
+                keep = (torch.rand(B, device=device) > self.cfg_dropout_prob)
+                null_class = torch.full_like(class_labels, self.null_class_id)
+                class_labels = torch.where(keep, class_labels, null_class)
+        # Predict noise
+        pred_eps = self.forward(xt, t, text_embeds, attention_mask, class_labels)
+        return F.mse_loss(pred_eps, noise)
+    def rectified_flow_loss(self, x0: torch.Tensor, text_embeds: Optional[torch.Tensor] = None,
+                            attention_mask: Optional[torch.Tensor] = None, class_labels: Optional[torch.Tensor] = None,
+                            null_text_embeds: Optional[torch.Tensor] = None, null_attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Rectified Flow: predict velocity v = x_1 - x_0 (straight paths)"""
+        B = x0.shape[0]
+        device = x0.device
+        # Sample time uniformly
+        t = torch.rand(B, device=device)
+        # Straight-line interpolation
+        x1 = torch.randn_like(x0)  # Gaussian noise as x_1
+        xt = (1 - t).view(-1, 1, 1, 1) * x0 + t.view(-1, 1, 1, 1) * x1
+        # Apply CFG dropout during training
+        if self.training and self.cfg_dropout_prob > 0:
+            if self.use_text and text_embeds is not None:
+                keep = (torch.rand(B, device=device) > self.cfg_dropout_prob)  # True = keep text
+                if null_text_embeds is not None:
+                    # Use provided null text embeddings (from empty string CLIP encoding)
+                    if null_text_embeds.shape[0] == 1:
+                        null_text_embeds = null_text_embeds.expand(B, -1, -1)
+                    # Replace dropped samples with null text embeddings
+                    dropped = ~keep
+                    if dropped.any():
+                        text_embeds = text_embeds.clone()
+                        text_embeds[dropped] = null_text_embeds[dropped]
+                        # Use provided null attention mask or create default for empty string
+                        if attention_mask is not None:
+                            attention_mask = attention_mask.clone()
+                            if null_attention_mask is not None:
+                                if null_attention_mask.shape[0] == 1:
+                                    null_attention_mask = null_attention_mask.expand(B, -1)
+                                attention_mask[dropped] = null_attention_mask[dropped]
+                            else:
+                                attention_mask[dropped] = 0
+                                attention_mask[dropped, 0] = 1
+                else:
+                    # Fallback to old zeroing approach if null_text_embeds not provided
+                    if text_embeds.dim() == 3:   # [B, T, D]
+                        text_embeds = text_embeds * keep[:, None, None].to(text_embeds.dtype)
+                    else:                        # [B, D]
+                        text_embeds = text_embeds * keep[:, None].to(text_embeds.dtype)
+                    if attention_mask is not None:
+                        attention_mask = attention_mask.clone()
+                        dropped = ~keep
+                        if dropped.any():
+                            attention_mask[dropped, 0] = 1
+            elif self.use_class and class_labels is not None:
+                # Apply CFG dropout to class labels using null class embedding
+                keep = (torch.rand(B, device=device) > self.cfg_dropout_prob)
+                null_class = torch.full_like(class_labels, self.null_class_id)
+                class_labels = torch.where(keep, class_labels, null_class)
+        # Predict velocity (x_1 - x_0)
+        pred_v = self.forward(xt, t, text_embeds, attention_mask, class_labels)
+        true_v = x1 - x0
+        return F.mse_loss(pred_v, true_v)
+    def flow_matching_loss(self, x0: torch.Tensor, text_embeds: Optional[torch.Tensor] = None,
+                           attention_mask: Optional[torch.Tensor] = None, class_labels: Optional[torch.Tensor] = None,
+                           null_text_embeds: Optional[torch.Tensor] = None, null_attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """Flow matching loss for latent space training with CFG dropout."""
+        B = x0.shape[0]
+        device = x0.device
+        # Sample time uniformly
+        t = torch.rand(B, device=device)
+        # Use proper schedule (NEW)
+        alpha_t, sigma_t = self.schedule.get_schedule(t)
+        noise = torch.randn_like(x0)
+        xt = alpha_t.view(-1, 1, 1, 1) * x0 + sigma_t.view(-1, 1, 1, 1) * noise
+        # Apply CFG dropout during training
+        if self.training and self.cfg_dropout_prob > 0:
+            if self.use_text and text_embeds is not None:
+                keep = (torch.rand(B, device=device) > self.cfg_dropout_prob)  # True = keep text
+                if null_text_embeds is not None:
+                    # Use provided null text embeddings (from empty string CLIP encoding)
+                    # Ensure null_text_embeds matches the batch size
+                    if null_text_embeds.shape[0] == 1:
+                        null_text_embeds = null_text_embeds.expand(B, -1, -1)
+                    # Replace dropped samples with null text embeddings
+                    dropped = ~keep
+                    if dropped.any():
+                        text_embeds = text_embeds.clone()
+                        text_embeds[dropped] = null_text_embeds[dropped]
+                        # Use provided null attention mask or create default for empty string
+                        if attention_mask is not None:
+                            attention_mask = attention_mask.clone()
+                            if null_attention_mask is not None:
+                                # Ensure null_attention_mask matches batch size
+                                if null_attention_mask.shape[0] == 1:
+                                    null_attention_mask = null_attention_mask.expand(B, -1)
+                                attention_mask[dropped] = null_attention_mask[dropped]
+                            else:
+                                # Default: For null text (empty string), typically only the first token is valid
+                                attention_mask[dropped] = 0
+                                attention_mask[dropped, 0] = 1  # Keep only first token for empty string
+                else:
+                    # Fallback to old zeroing approach if null_text_embeds not provided
+                    if text_embeds.dim() == 3:   # [B, T, D]
+                        text_embeds = text_embeds * keep[:, None, None].to(text_embeds.dtype)
+                    else:                        # [B, D]
+                        text_embeds = text_embeds * keep[:, None].to(text_embeds.dtype)
+                    # Handle attention mask for fallback approach
+                    if attention_mask is not None:
+                        attention_mask = attention_mask.clone()
+                        dropped = ~keep
+                        if dropped.any():
+                            attention_mask[dropped, 0] = 1
+            elif self.use_class and class_labels is not None:
+                # Apply CFG dropout to class labels using null class embedding
+                keep = (torch.rand(B, device=device) > self.cfg_dropout_prob)  # True = keep class
+                # Use the dedicated null class embedding for unconditional generation
+                null_class = torch.full_like(class_labels, self.null_class_id)
+                class_labels = torch.where(keep, class_labels, null_class)
+        # Predict velocity
+        pred_v = self.forward(xt, t, text_embeds, attention_mask, class_labels)
+        true_v = noise - x0
+        return F.mse_loss(pred_v, true_v)
+# =============================================================================
+# ROUTER MODELS
+# =============================================================================
+class ViTRouter(nn.Module):
+    """ViT-based router for cluster classification"""
+    def __init__(self, config) -> None:
+        super().__init__()
+        # Default params
+        default_params = {
+            "hidden_size": 384,
+            "num_layers": 6,
+            "num_heads": 6,
+            "patch_size": 8,
+            "use_dit_time_embed": False,  # Whether to use DiT-style time embedding
+        }
+        params = {**default_params, **config.router_params}
+        if config.router_pretrained:
+            # Use pretrained ViT and adapt
+            self.vit = ViTForImageClassification.from_pretrained(
+                "google/vit-base-patch16-224"
+            )
+            self._adapt_pretrained(config, params)
+        else:
+            # Build from scratch
+            vit_config = ViTConfig(
+                image_size=config.image_size,
+                patch_size=params["patch_size"],
+                num_channels=config.num_channels,
+                hidden_size=params["hidden_size"],
+                num_hidden_layers=params["num_layers"],
+                num_attention_heads=params["num_heads"],
+                num_labels=config.num_clusters
+            )
+            self.vit = ViTForImageClassification(vit_config)
+        # Time conditioning - support both embedding styles
+        self.use_dit_time_embed = params.get("use_dit_time_embed", False)
+        if self.use_dit_time_embed:
+            # Use DiT-style timestep embedding for consistency
+            self.time_embedding = DiTTimestepEmbedder(params["hidden_size"])
+        else:
+            # Original simple time embedding
+            self.time_embedding = nn.Sequential(
+                nn.Linear(1, params["hidden_size"]),
+                nn.SiLU(),
+                nn.Linear(params["hidden_size"], params["hidden_size"])
+            )
+        # Combined classifier
+        self.classifier = nn.Sequential(
+            nn.Linear(params["hidden_size"] * 2, params["hidden_size"]),
+            nn.ReLU(),
+            nn.Dropout(0.1),
+            nn.Linear(params["hidden_size"], config.num_clusters)
+        )
+    def _adapt_pretrained(self, config, params) -> ViTForImageClassification:
+        """Adapt pretrained ViT for our task"""
+        # Modify patch embeddings if needed
+        if config.image_size != 224 or config.num_channels != 3:
+            self.vit.vit.embeddings.patch_embeddings.projection = nn.Conv2d(
+                config.num_channels,
+                self.vit.config.hidden_size,
+                kernel_size=params["patch_size"],
+                stride=params["patch_size"]
+            )
+    def forward(self, xt: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
+        # Process image through ViT
+        vit_outputs = self.vit.vit(xt)
+        image_features = vit_outputs.last_hidden_state[:, 0]  # CLS token
+        # Time conditioning
+        if self.use_dit_time_embed:
+            # DiT embedder expects raw timesteps
+            time_features = self.time_embedding(t)
+        else:
+            # Original embedding needs unsqueeze
+            time_features = self.time_embedding(t.unsqueeze(-1))
+        # Combine and classify
+        combined = torch.cat([image_features, time_features], dim=1)
+        return self.classifier(combined)
+class CNNRouter(nn.Module):
+    """Simple CNN router for cluster classification"""
+    def __init__(self, config) -> None:
+        super().__init__()
+        # Default params
+        default_params = {
+            "hidden_dims": [64, 128, 256],
+            "use_dit_time_embed": False,  # Whether to use DiT-style time embedding
+        }
+        params = {**default_params, **config.router_params}
+        # CNN backbone
+        self.backbone = self._build_cnn(config.num_channels, params["hidden_dims"])
+        # Time embedding - support both styles
+        self.use_dit_time_embed = params.get("use_dit_time_embed", False)
+        if self.use_dit_time_embed:
+            # Use DiT-style timestep embedding, output to 128 dims for CNN
+            self.time_embedding = DiTTimestepEmbedder(128)
+        else:
+            # Original simple time embedding
+            self.time_embedding = nn.Sequential(
+                nn.Linear(1, 128),
+                nn.SiLU(),
+                nn.Linear(128, 128)
+            )
+        # Classifier
+        self.classifier = nn.Sequential(
+            nn.Linear(params["hidden_dims"][-1] + 128, 256),
+            nn.ReLU(),
+            nn.Dropout(0.1),
+            nn.Linear(256, config.num_clusters)
+        )
+    def _build_cnn(self, in_channels: int, hidden_dims: List[int]) -> nn.Sequential:
+        layers = []
+        prev_dim = in_channels
+        for dim in hidden_dims:
+            layers.extend([
+                nn.Conv2d(prev_dim, dim, 3, padding=1),
+                nn.ReLU(),
+                nn.Conv2d(dim, dim, 3, padding=1),
+                nn.ReLU(),
+                nn.MaxPool2d(2)
+            ])
+            prev_dim = dim
+        layers.append(nn.AdaptiveAvgPool2d(1))
+        layers.append(nn.Flatten())
+        return nn.Sequential(*layers)
+    def forward(self, xt: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
+        # CNN features
+        img_features = self.backbone(xt)
+        # Time features
+        if self.use_dit_time_embed:
+            # DiT embedder expects raw timesteps
+            time_features = self.time_embedding(t)
+        else:
+            # Original embedding needs unsqueeze
+            time_features = self.time_embedding(t.unsqueeze(-1))
+        # Combine and classify
+        combined = torch.cat([img_features, time_features], dim=1)
+        return self.classifier(combined)
+class DiTRouter(nn.Module):
+    """DiT B/2 router for cluster classification"""
+    def __init__(self, config):
+        super().__init__()
+        # DiT B/2 specifications
+        default_params = {
+            "hidden_size": 768,      # DiT-B uses 768
+            "num_layers": 12,        # DiT-B uses 12 layers
+            "num_heads": 12,         # DiT-B uses 12 heads
+            "patch_size": 2,         # For latent space (32x32 -> 16x16 patches)
+            "in_channels": 4,        # VAE latent channels
+            "mlp_ratio": 4.0,
+            "use_dit_time_embed": False,  # Whether to use DiT-style time embedding
+        }
+        params = {**default_params, **config.router_params}
+        self.patch_size = params["patch_size"]
+        self.in_channels = params["in_channels"]
+        self.hidden_size = params["hidden_size"]
+        self.num_heads = params["num_heads"]
+        self.num_clusters = config.num_clusters
+        # Patch embedding (same as expert)
+        self.patch_embed = nn.Conv2d(
+            self.in_channels, self.hidden_size,
+            kernel_size=self.patch_size, stride=self.patch_size
+        )
+        # Calculate number of patches
+        latent_size = getattr(config, 'image_size', 32)  # Assuming 256/8=32 for VAE
+        self.num_patches = (latent_size // self.patch_size) ** 2
+        # Fixed sin-cos positional embedding (same as expert)
+        self.pos_embed = nn.Parameter(
+            torch.zeros(1, self.num_patches, self.hidden_size),
+            requires_grad=False
+        )
+        # CLS token (KEY ADDITION from paper)
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, self.hidden_size))
+        # Time embedding - match expert's choice
+        self.use_dit_time_embed = params.get("use_dit_time_embed", False)
+        if self.use_dit_time_embed:
+            self.time_embed = DiTTimestepEmbedder(self.hidden_size)
+        else:
+            self.time_embed = TimestepEmbedder(self.hidden_size)
+        # DiT blocks with AdaLN (reuse DiTBlock from expert)
+        # Note: Router doesn't need text conditioning
+        self.layers = nn.ModuleList([
+            DiTBlock(self.hidden_size, self.num_heads, params["mlp_ratio"], use_text=False)
+            for _ in range(params["num_layers"])
+        ])
+        # Final layer norm
+        self.norm_final = nn.LayerNorm(self.hidden_size, elementwise_affine=False, eps=1e-6)
+        # Linear classifier on CLS token (as specified in paper)
+        # self.head = nn.Linear(self.hidden_size, self.num_clusters)
+        self.head = nn.Sequential(
+            nn.Linear(self.hidden_size, self.hidden_size),
+            nn.GELU(),
+            nn.LayerNorm(self.hidden_size),
+            nn.Dropout(0.1),
+            nn.Linear(self.hidden_size, self.num_clusters)
+        )
+        # Initialize weights
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Initialize transformer layers
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize CLS token
+        nn.init.normal_(self.cls_token, std=0.02)
+        # Initialize positional embedding with sin-cos (same as expert)
+        grid_size = int(self.num_patches ** 0.5)
+        pos_embed = get_2d_sincos_pos_embed(self.pos_embed.shape[-1], grid_size)
+        self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize patch_embed like nn.Linear
+        w = self.patch_embed.weight.data
+        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+        if self.patch_embed.bias is not None:
+            nn.init.constant_(self.patch_embed.bias, 0)
+        # Initialize timestep embedding MLP
+        if hasattr(self.time_embed, 'mlp'):
+            nn.init.normal_(self.time_embed.mlp[0].weight, std=0.02)
+            nn.init.normal_(self.time_embed.mlp[2].weight, std=0.02)
+        # Zero-out adaLN modulation in blocks (following expert initialization)
+        for block in self.layers:
+            nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
+            nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
+        # # Initialize classification head (simpler version for classification head)
+        # nn.init.constant_(self.head.weight, 0)
+        # nn.init.constant_(self.head.bias, 0)
+        # Initialize classification head (Sequential)
+        # Initialize intermediate layers normally, zero-out final layer
+        nn.init.normal_(self.head[0].weight, std=0.02)  # First linear layer
+        if self.head[0].bias is not None:
+            nn.init.constant_(self.head[0].bias, 0)
+        # Zero-out final classification layer (following DiT paper)
+        nn.init.constant_(self.head[-1].weight, 0)      # Last linear layer
+        if self.head[-1].bias is not None:
+            nn.init.constant_(self.head[-1].bias, 0)
+    def forward(self, xt: torch.Tensor, t: torch.Tensor) -> torch.Tensor:
+        B, C, H, W = xt.shape
+        # Match expert's timestep interpretation
+        if t.max() <= 1.0 and t.min() >= 0.0:
+            t = t * 999.0
+        t = t.clamp(0, 999)
+        # Patchify
+        x = self.patch_embed(xt)  # [B, hidden_size, H//p, W//p]
+        x = x.flatten(2).transpose(1, 2)  # [B, num_patches, hidden_size]
+        # Add positional embedding
+        x = x + self.pos_embed
+        # Prepend CLS token
+        cls_tokens = self.cls_token.expand(B, -1, -1)  # [B, 1, hidden_size]
+        x = torch.cat([cls_tokens, x], dim=1)  # [B, 1 + num_patches, hidden_size]
+        # Time conditioning
+        c = self.time_embed(t)  # [B, hidden_size]
+        # Apply DiT blocks with AdaLN modulation
+        for layer in self.layers:
+            x = layer(x, c, text_emb=None)
+        # Extract CLS token and apply final norm
+        cls_output = x[:, 0]  # [B, hidden_size]
+        cls_output = self.norm_final(cls_output)
+        # Linear classification head
+        logits = self.head(cls_output)  # [B, num_clusters]
+        return logits
+# =============================================================================
+# DETERMINISTIC ROUTER (for controlled experiments)
+# =============================================================================
+class DeterministicTimestepRouter(nn.Module):
+    """
+    Deterministic router that assigns experts based on timestep.
+    Useful for controlled experiments where you want to test specific routing strategies,
+    such as: "high noise → DDPM expert, low noise → FM expert"
+    Args:
+        config: Config object with router_params containing:
+            - timestep_threshold: t value to switch experts (default: 0.5)
+            - high_noise_expert: Expert ID for t > threshold (default: 0, typically DDPM)
+            - low_noise_expert: Expert ID for t <= threshold (default: 1, typically FM)
+    Example config:
+        router_architecture: "deterministic_timestep"
+        router_params:
+            timestep_threshold: 0.5
+            high_noise_expert: 0  # DDPM for high noise
+            low_noise_expert: 1   # FM for low noise
+    """
+    def __init__(self, config):
+        super().__init__()
+        self.num_experts = config.num_experts
+        self.threshold = config.router_params.get('timestep_threshold', 0.5)
+        self.high_noise_expert = config.router_params.get('high_noise_expert', 0)
+        self.low_noise_expert = config.router_params.get('low_noise_expert', 1)
+        # Validate expert IDs
+        assert 0 <= self.high_noise_expert < self.num_experts, \
+            f"high_noise_expert {self.high_noise_expert} out of range [0, {self.num_experts})"
+        assert 0 <= self.low_noise_expert < self.num_experts, \
+            f"low_noise_expert {self.low_noise_expert} out of range [0, {self.num_experts})"
+        # Validate threshold
+        assert 0.0 <= self.threshold <= 1.0, \
+            f"timestep_threshold {self.threshold} must be in [0, 1]"
+        # This router has no trainable parameters
+        # Register threshold as buffer (not trained, but saved with model)
+        self.register_buffer('_threshold', torch.tensor(self.threshold))
+        print(f"DeterministicTimestepRouter initialized:")
+        print(f"  Threshold: {self.threshold}")
+        print(f"  High noise (t > {self.threshold}) → Expert {self.high_noise_expert}")
+        print(f"  Low noise (t <= {self.threshold}) → Expert {self.low_noise_expert}")
+    def forward(self, x: torch.Tensor, t: torch.Tensor, **kwargs) -> torch.Tensor:
+        """
+        Returns one-hot routing probabilities based on timestep.
+        Args:
+            x: Input tensor (unused, but kept for API compatibility with other routers)
+            t: Timesteps, shape (B,)
+        Returns:
+            routing_probs: Shape (B, num_experts), one-hot encoded
+        """
+        B = t.shape[0]
+        device = t.device
+        # Initialize routing probabilities (all zeros)
+        routing_probs = torch.zeros(B, self.num_experts, device=device)
+        # High noise (t > threshold) → high_noise_expert
+        # Low noise (t <= threshold) → low_noise_expert
+        high_noise_mask = t > self.threshold
+        routing_probs[high_noise_mask, self.high_noise_expert] = 1.0
+        routing_probs[~high_noise_mask, self.low_noise_expert] = 1.0
+        return routing_probs
+    def train(self, mode: bool = True):
+        """Override train() - this router is never trained, always in eval mode"""
+        return super(DeterministicTimestepRouter, self).train(False)
+# =============================================================================
+# ADAPTIVE VIDEO ROUTER (for Video DDM)
+# =============================================================================
+class AdaptiveVideoRouter(nn.Module):
+    """
+    Time-adaptive router for video DDM.
+    Key innovation: Learns optimal weighting of information sources
+    at each noise level, solving the "motion invisible at t=1" problem.
+    Information availability is time-dependent:
+        t ~ 1.0: Only text/first_frame informative → Route on conditioning
+        t ~ 0.5: Structure emerging → Latent becomes useful
+        t ~ 0.1: Near clean → Full information available
+    Expected learned behavior:
+        | Noise Level | Text | Frame | Latent | Behavior                    |
+        |-------------|------|-------|--------|-----------------------------|
+        | t ~ 1.0     | ~0.7 | ~0.2  | ~0.1   | Routes on text semantics    |
+        | t ~ 0.5     | ~0.4 | ~0.3  | ~0.3   | Balanced; emerging structure|
+        | t ~ 0.1     | ~0.2 | ~0.2  | ~0.6   | Trusts latent; fine-grained |
+    Enhancements:
+        - Masked mean pooling for text (handles variable-length prompts)
+        - Temporal-aware latent encoder (captures motion patterns)
+        - Temperature scaling for inference control
+    """
+    def __init__(self, config):
+        super().__init__()
+        # Default params
+        default_params = {
+            "hidden_dim": 512,
+            "text_embed_dim": 768,      # CLIP-L text embedding dimension
+            "frame_embed_dim": 768,     # DINOv2-B (base) feature dimension
+            "latent_channels": 16,      # VAE latent channels (CogVideoX uses 16)
+            "latent_conv_dim": 64,      # Intermediate conv channels for latent encoder
+            "dropout": 0.1,
+            "temporal_pool_mode": "attention",  # "attention", "avg", or "max"
+            "normalize_inputs": True,   # L2-normalize text/frame inputs (match clustering)
+        }
+        params = {**default_params, **getattr(config, 'router_params', {})}
+        self.hidden_dim = params["hidden_dim"]
+        self.num_experts = getattr(config, 'num_experts', config.num_clusters)
+        self.latent_channels = params["latent_channels"]
+        self.latent_conv_dim = params["latent_conv_dim"]
+        self.temporal_pool_mode = params["temporal_pool_mode"]
+        self.normalize_inputs = params.get("normalize_inputs", True)
+        # === Information Source Encoders ===
+        # Text pathway (always available, primary signal at high t)
+        self.text_encoder = nn.Sequential(
+            nn.Linear(params["text_embed_dim"], self.hidden_dim),
+            nn.LayerNorm(self.hidden_dim),
+            nn.GELU(),
+            nn.Linear(self.hidden_dim, self.hidden_dim)
+        )
+        # First frame pathway (available for I2V tasks)
+        # Uses DINOv2 features extracted from the first frame
+        self.frame_encoder = nn.Sequential(
+            nn.Linear(params["frame_embed_dim"], self.hidden_dim),
+            nn.LayerNorm(self.hidden_dim),
+            nn.GELU(),
+            nn.Linear(self.hidden_dim, self.hidden_dim)
+        )
+        # === Temporal-Aware Latent Encoder ===
+        # Captures both spatial content and temporal motion patterns
+        # Spatial feature extraction (per-frame)
+        self.spatial_conv = nn.Sequential(
+            nn.Conv3d(params["latent_channels"], params["latent_conv_dim"],
+                     kernel_size=(1, 3, 3), padding=(0, 1, 1)),  # Spatial only
+            nn.GroupNorm(8, params["latent_conv_dim"]),
+            nn.GELU(),
+        )
+        # Temporal feature extraction (motion patterns)
+        self.temporal_conv = nn.Sequential(
+            nn.Conv3d(params["latent_conv_dim"], params["latent_conv_dim"],
+                     kernel_size=(3, 1, 1), padding=(1, 0, 0)),  # Temporal only
+            nn.GroupNorm(8, params["latent_conv_dim"]),
+            nn.GELU(),
+        )
+        # Combined spatio-temporal processing
+        self.st_conv = nn.Sequential(
+            nn.Conv3d(params["latent_conv_dim"], params["latent_conv_dim"],
+                     kernel_size=3, padding=1),  # Full 3D
+            nn.GroupNorm(8, params["latent_conv_dim"]),
+            nn.GELU(),
+        )
+        # Spatial pooling (keep temporal dimension)
+        self.spatial_pool = nn.AdaptiveAvgPool3d((None, 1, 1))  # [B, C, T, 1, 1]
+        # Temporal attention pooling (learns which frames matter for routing)
+        if self.temporal_pool_mode == "attention":
+            self.temporal_attn = nn.Sequential(
+                nn.Linear(params["latent_conv_dim"], params["latent_conv_dim"] // 4),
+                nn.Tanh(),
+                nn.Linear(params["latent_conv_dim"] // 4, 1),
+            )
+        # Motion feature extractor (frame differences)
+        self.motion_encoder = nn.Sequential(
+            nn.Linear(params["latent_conv_dim"], params["latent_conv_dim"]),
+            nn.GELU(),
+            nn.Linear(params["latent_conv_dim"], self.hidden_dim // 2),
+        )
+        # Content feature projector
+        self.content_proj = nn.Linear(params["latent_conv_dim"], self.hidden_dim // 2)
+        # Final latent projection (combines content + motion)
+        self.latent_proj = nn.Sequential(
+            nn.Linear(self.hidden_dim, self.hidden_dim),
+            nn.LayerNorm(self.hidden_dim),
+        )
+        # === Time-Dependent Weighting ===
+        # Time embedding using existing infrastructure
+        self.time_embed = TimestepEmbedder(self.hidden_dim)
+        self.time_mlp = nn.Sequential(
+            nn.Linear(self.hidden_dim, self.hidden_dim),
+            nn.GELU(),
+            nn.Linear(self.hidden_dim, self.hidden_dim)
+        )
+        # Learns adaptive weighting: at high t → trust text; at low t → trust latent
+        self.source_weighting = nn.Sequential(
+            nn.Linear(self.hidden_dim, 128),
+            nn.GELU(),
+            nn.Linear(128, 3),  # [text, frame, latent] weights
+            nn.Softmax(dim=-1)
+        )
+        # === Routing Head ===
+        self.router_head = nn.Sequential(
+            nn.Linear(self.hidden_dim, self.hidden_dim),
+            nn.GELU(),
+            nn.LayerNorm(self.hidden_dim),
+            nn.Dropout(params["dropout"]),
+            nn.Linear(self.hidden_dim, self.num_experts)
+        )
+        # Initialize weights
+        self.initialize_weights()
+    def initialize_weights(self):
+        """Initialize weights following DiT conventions."""
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                torch.nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+            elif isinstance(module, nn.Conv3d):
+                # Flatten spatial dims for xavier init
+                w = module.weight.data
+                nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize timestep embedding MLP (following DiT)
+        if hasattr(self.time_embed, 'mlp'):
+            nn.init.normal_(self.time_embed.mlp[0].weight, std=0.02)
+            nn.init.normal_(self.time_embed.mlp[2].weight, std=0.02)
+        # Small non-zero initialization for final routing layer
+        # (pure zeros cause uniform outputs that break temperature scaling)
+        nn.init.normal_(self.router_head[-1].weight, std=0.01)
+        nn.init.constant_(self.router_head[-1].bias, 0)
+        # Initialize source weighting to start roughly uniform
+        # The softmax will make [0, 0, 0] → [0.33, 0.33, 0.33]
+        nn.init.constant_(self.source_weighting[-2].weight, 0)
+        nn.init.constant_(self.source_weighting[-2].bias, 0)
+        # Initialize temporal attention to uniform attention
+        if self.temporal_pool_mode == "attention":
+            nn.init.constant_(self.temporal_attn[-1].weight, 0)
+            nn.init.constant_(self.temporal_attn[-1].bias, 0)
+    def _masked_mean_pool(self, embeddings: torch.Tensor, attention_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """
+        Compute masked mean pooling over sequence dimension.
+        Args:
+            embeddings: [B, seq_len, embed_dim] - Token embeddings
+            attention_mask: [B, seq_len] - 1 for real tokens, 0 for padding
+        Returns:
+            pooled: [B, embed_dim] - Pooled representation
+        """
+        if attention_mask is None:
+            # No mask provided, use simple mean
+            return embeddings.mean(dim=1)
+        # Expand mask for broadcasting: [B, seq_len] -> [B, seq_len, 1]
+        mask = attention_mask.unsqueeze(-1).to(embeddings.dtype)
+        # Masked sum
+        masked_sum = (embeddings * mask).sum(dim=1)  # [B, embed_dim]
+        # Count of valid tokens (avoid division by zero)
+        token_counts = mask.sum(dim=1).clamp(min=1.0)  # [B, 1]
+        return masked_sum / token_counts
+    def _encode_latent_temporal(self, x_t: torch.Tensor) -> torch.Tensor:
+        """
+        Encode video latent with temporal awareness.
+        Extracts both:
+        - Content features: What is in the video (spatial)
+        - Motion features: How things move (temporal differences)
+        Args:
+            x_t: [B, C, T, H, W] - Noisy video latent
+        Returns:
+            latent_feat: [B, hidden_dim] - Combined latent features
+        """
+        B, C, T, H, W = x_t.shape
+        # 1. Spatial feature extraction
+        spatial_feat = self.spatial_conv(x_t)  # [B, conv_dim, T, H, W]
+        # 2. Temporal feature extraction (captures local motion)
+        temporal_feat = self.temporal_conv(spatial_feat)  # [B, conv_dim, T, H, W]
+        # 3. Combined spatio-temporal processing
+        st_feat = self.st_conv(temporal_feat)  # [B, conv_dim, T, H, W]
+        # 4. Pool spatially, keep temporal: [B, conv_dim, T, 1, 1] -> [B, T, conv_dim]
+        pooled = self.spatial_pool(st_feat).squeeze(-1).squeeze(-1)  # [B, conv_dim, T]
+        pooled = pooled.permute(0, 2, 1)  # [B, T, conv_dim]
+        # 5. Temporal pooling with optional attention
+        if self.temporal_pool_mode == "attention" and T > 1:
+            # Learn which frames matter for routing
+            attn_scores = self.temporal_attn(pooled).squeeze(-1)  # [B, T]
+            attn_weights = F.softmax(attn_scores, dim=-1)  # [B, T]
+            content_feat = (pooled * attn_weights.unsqueeze(-1)).sum(dim=1)  # [B, conv_dim]
+        elif self.temporal_pool_mode == "max":
+            content_feat = pooled.max(dim=1)[0]  # [B, conv_dim]
+        else:  # "avg"
+            content_feat = pooled.mean(dim=1)  # [B, conv_dim]
+        # 6. Extract motion features (frame differences)
+        if T > 1:
+            # Compute frame-to-frame differences
+            frame_diffs = pooled[:, 1:] - pooled[:, :-1]  # [B, T-1, conv_dim]
+            # Motion magnitude and direction encoding
+            motion_feat = self.motion_encoder(frame_diffs.mean(dim=1))  # [B, hidden_dim//2]
+        else:
+            # Single frame, no motion
+            motion_feat = torch.zeros(B, self.hidden_dim // 2, device=x_t.device)
+        # 7. Project content features
+        content_proj = self.content_proj(content_feat)  # [B, hidden_dim//2]
+        # 8. Combine content + motion
+        combined = torch.cat([content_proj, motion_feat], dim=-1)  # [B, hidden_dim]
+        latent_feat = self.latent_proj(combined)  # [B, hidden_dim]
+        return latent_feat
+    def forward(self, x_t: torch.Tensor, t: torch.Tensor,
+                text_embed: torch.Tensor,
+                first_frame_feat: Optional[torch.Tensor] = None,
+                attention_mask: Optional[torch.Tensor] = None,
+                temperature: float = 1.0) -> torch.Tensor:
+        """
+        Compute routing logits with time-adaptive information weighting.
+        Args:
+            x_t: Noisy video latent [B, C, T, H, W]
+            t: Noise level [B] in [0, 1] or [0, 999]
+            text_embed: CLIP text embedding [B, text_embed_dim] or [B, seq_len, text_embed_dim]
+            first_frame_feat: Optional DINOv2 features [B, frame_embed_dim]
+            attention_mask: Optional [B, seq_len] mask for text (1=valid, 0=padding)
+            temperature: Softmax temperature for sharper/softer routing (default: 1.0)
+        Returns:
+            logits: Expert selection logits [B, num_experts] (scaled by temperature)
+        """
+        B = x_t.shape[0]
+        device = x_t.device
+        # === Encode each information source ===
+        # Handle both pooled [B, D] and sequence [B, seq_len, D] text embeddings
+        if text_embed.dim() == 3:
+            # Use masked mean pooling for sequence embeddings
+            text_embed_pooled = self._masked_mean_pool(text_embed, attention_mask)
+        else:
+            # Already pooled
+            text_embed_pooled = text_embed
+        # L2-normalize inputs to match clustering preprocessing
+        if self.normalize_inputs:
+            text_embed_pooled = F.normalize(text_embed_pooled, p=2, dim=-1)
+        text_feat = self.text_encoder(text_embed_pooled)  # [B, hidden_dim]
+        # Frame features (optional for T2V, required for I2V)
+        if first_frame_feat is not None:
+            # L2-normalize to match clustering preprocessing
+            if self.normalize_inputs:
+                first_frame_feat = F.normalize(first_frame_feat, p=2, dim=-1)
+            frame_feat = self.frame_encoder(first_frame_feat)  # [B, hidden_dim]
+        else:
+            frame_feat = torch.zeros(B, self.hidden_dim, device=device)
+        # Latent features from noisy video (temporal-aware encoding)
+        latent_feat = self._encode_latent_temporal(x_t)  # [B, hidden_dim]
+        # === Time-dependent weighting ===
+        # Normalize timesteps to [0, 999] for TimestepEmbedder
+        if t.max() <= 1.0:
+            t_scaled = t * 999.0
+        else:
+            t_scaled = t
+        t_scaled = t_scaled.clamp(0, 999)
+        # Get time features
+        time_emb = self.time_embed(t_scaled)  # [B, hidden_dim]
+        time_feat = self.time_mlp(time_emb)   # [B, hidden_dim]
+        # Compute adaptive weights based on noise level
+        # Network learns: high t → high text weight; low t → high latent weight
+        weights = self.source_weighting(time_feat)  # [B, 3]
+        # === Adaptive combination ===
+        combined = (
+            weights[:, 0:1] * text_feat +    # Text contribution
+            weights[:, 1:2] * frame_feat +   # Frame contribution
+            weights[:, 2:3] * latent_feat    # Latent contribution
+        )
+        # Final routing decision (incorporate time context)
+        logits = self.router_head(combined + time_feat)
+        # Apply temperature scaling (lower temp = sharper routing)
+        if temperature != 1.0:
+            logits = logits / temperature
+        return logits
+    def get_source_weights(self, t: torch.Tensor) -> torch.Tensor:
+        """
+        Get the learned source weights for given timesteps.
+        Useful for debugging and visualization.
+        Args:
+            t: Noise levels [B] in [0, 1] or [0, 999]
+        Returns:
+            weights: Source weights [B, 3] for [text, frame, latent]
+        """
+        # Normalize timesteps
+        if t.max() <= 1.0:
+            t_scaled = t * 999.0
+        else:
+            t_scaled = t
+        t_scaled = t_scaled.clamp(0, 999)
+        time_emb = self.time_embed(t_scaled)
+        time_feat = self.time_mlp(time_emb)
+        weights = self.source_weighting(time_feat)
+        return weights
+# =============================================================================
+# MODEL FACTORY FUNCTIONS
+# =============================================================================
+def create_expert(config, expert_id: Optional[int] = None) -> nn.Module:
+    """
+    Factory function to create expert model
+    Args:
+        config: Config object
+        expert_id: Optional expert ID for per-expert schedule/objective configuration
+    """
+    # Make a copy of config to avoid modifying the original
+    import copy
+    config = copy.copy(config)
+    config.expert_params = config.expert_params.copy()
+    # Inject schedule_type into expert_params if not already present
+    if "schedule_type" not in config.expert_params:
+        # Check for per-expert schedule first (with backward compatibility)
+        if (hasattr(config, 'expert_schedule_types') and
+            config.expert_schedule_types and
+            expert_id is not None and
+            expert_id in config.expert_schedule_types):
+            config.expert_params["schedule_type"] = config.expert_schedule_types[expert_id]
+        else:
+            # Use default schedule_type (with fallback for old configs)
+            config.expert_params["schedule_type"] = getattr(config, 'schedule_type', 'linear_interp')
+    # Inject objective_type into expert_params if not already present
+    if "objective_type" not in config.expert_params:
+        # Check for per-expert objectives (with backward compatibility)
+        if (hasattr(config, 'expert_objectives') and
+            config.expert_objectives and
+            expert_id is not None and
+            expert_id in config.expert_objectives):
+            config.expert_params["objective_type"] = config.expert_objectives[expert_id]
+        else:
+            # Use default objective (with fallback for old configs)
+            config.expert_params["objective_type"] = getattr(config, 'default_objective', 'fm')
+    if config.expert_architecture == "unet":
+        return UNetExpert(config)
+    elif config.expert_architecture == "simple_cnn":
+        return SimpleCNNExpert(config)
+    elif config.expert_architecture == "dit":
+        return DiTExpert(config)
+    else:
+        raise ValueError(f"Unknown expert architecture: {config.expert_architecture}")
+def create_router(config) -> Optional[nn.Module]:
+    """Factory function to create router model"""
+    if config.router_architecture == "none" or config.is_monolithic:
+        return None
+    elif config.router_architecture == "deterministic_timestep":
+        return DeterministicTimestepRouter(config)
+    elif config.router_architecture == "vit":
+        return ViTRouter(config)
+    elif config.router_architecture == "cnn":
+        return CNNRouter(config)
+    elif config.router_architecture == "dit":
+        return DiTRouter(config)
+    elif config.router_architecture == "adaptive_video":
+        return AdaptiveVideoRouter(config)
+    else:
+        raise ValueError(f"Unknown router architecture: {config.router_architecture}")

src/schedules.py ADDED Viewed

	@@ -0,0 +1,166 @@

+# src/schedules.py
+"""
+Centralized noise schedule manager for diffusion models.
+Supports three schedules:
+1. 'cosine': Cosine schedule (Nichol & Dhariwal 2021)
+2. 'linear_beta': Linear beta schedule (Ho et al. 2020)
+3. 'linear_interp': Linear interpolation - Flow Matching default
+All schedules return (alpha_t, sigma_t) such that:
+    x_t = alpha_t * x_0 + sigma_t * epsilon
+    alpha_t^2 + sigma_t^2 = 1  (variance preserving)
+"""
+import torch
+import math
+from typing import Tuple
+class NoiseSchedule:
+    """
+    Centralized noise schedule manager.
+    Args:
+        schedule_type: One of ['cosine', 'linear_beta', 'linear_interp']
+    """
+    def __init__(self, schedule_type: str = 'linear_interp'):
+        assert schedule_type in ['cosine', 'linear_beta', 'linear_interp'], \
+            f"Unknown schedule: {schedule_type}. Must be one of ['cosine', 'linear_beta', 'linear_interp']"
+        self.schedule_type = schedule_type
+        # Linear beta schedule parameters (if used)
+        self.beta_min = 0.0001
+        self.beta_max = 0.02
+        self.num_timesteps = 1000  # T in discrete formulation
+        # Cosine schedule parameter
+        self.s = 0.008  # Small offset to prevent beta from being too small near t=0
+    def get_schedule(self, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Get (alpha_t, sigma_t) for given timesteps.
+        Args:
+            t: Tensor of timesteps in [0, 1], shape (B,)
+        Returns:
+            alpha_t: Shape (B,), coefficient for x_0
+            sigma_t: Shape (B,), coefficient for epsilon
+        """
+        if self.schedule_type == 'cosine':
+            return self._cosine_schedule(t)
+        elif self.schedule_type == 'linear_beta':
+            return self._linear_beta_schedule(t)
+        elif self.schedule_type == 'linear_interp':
+            return self._linear_interpolation(t)
+    def _cosine_schedule(self, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Cosine schedule: alpha_bar_t = f(t) / f(0)
+        where f(t) = cos²((t + s)/(1 + s) * π/2)
+        Reference: "Improved Denoising Diffusion Probabilistic Models"
+        (Nichol & Dhariwal, 2021)
+        This schedule provides better conditioning than linear beta schedule,
+        especially at very small and very large t values.
+        """
+        # Compute f(t) = cos²((t + s)/(1 + s) * π/2)
+        f_t = torch.cos(((t + self.s) / (1 + self.s)) * math.pi * 0.5) ** 2
+        # Compute f(0) for normalization to ensure alpha_bar_0 = 1
+        f_0 = math.cos((self.s / (1 + self.s)) * math.pi * 0.5) ** 2
+        # Normalize: alpha_bar_t = f(t) / f(0)
+        alpha_bar_t = f_t / f_0
+        # Clamp to ensure numerical stability
+        alpha_bar_t = torch.clamp(alpha_bar_t, min=1e-8, max=1.0)
+        # Compute coefficients
+        alpha_t = torch.sqrt(alpha_bar_t)
+        sigma_t = torch.sqrt(1 - alpha_bar_t)
+        return alpha_t, sigma_t
+    def _linear_beta_schedule(self, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Linear beta schedule: beta_t increases linearly from beta_min to beta_max
+        Reference: "Denoising Diffusion Probabilistic Models" (Ho et al., 2020)
+        For continuous time t ∈ [0,1]:
+        beta(t) = beta_min + t * (beta_max - beta_min)
+        alpha_bar(t) = exp(-0.5 * integral_0^t beta(s) ds)
+                     = exp(-0.5 * t * (beta_min + 0.5 * t * (beta_max - beta_min)))
+        """
+        # Compute alpha_bar(t) = exp(-0.5 * integral beta(s) ds)
+        # integral_0^t (beta_min + s * (beta_max - beta_min)) ds
+        #   = beta_min * t + 0.5 * t^2 * (beta_max - beta_min)
+        integral_beta = self.beta_min * t + 0.5 * t * t * (self.beta_max - self.beta_min)
+        alpha_bar_t = torch.exp(-0.5 * integral_beta * self.num_timesteps)
+        # Compute coefficients
+        alpha_t = torch.sqrt(alpha_bar_t)
+        sigma_t = torch.sqrt(1 - alpha_bar_t)
+        return alpha_t, sigma_t
+    def _linear_interpolation(self, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        """
+        Linear interpolation: x_t = (1-t) * x_0 + t * epsilon
+        This is the default for Flow Matching but NOT a proper DDPM schedule.
+        This is what the current implementation uses.
+        """
+        alpha_t = 1 - t
+        sigma_t = t
+        return alpha_t, sigma_t
+    def get_snr(self, t: torch.Tensor) -> torch.Tensor:
+        """
+        Compute signal-to-noise ratio (SNR) = alpha_t^2 / sigma_t^2
+        Useful for:
+        1. Time warping between different schedules
+        2. Analysis and visualization
+        Args:
+            t: Tensor of timesteps in [0, 1]
+        Returns:
+            snr: Signal-to-noise ratio at each timestep
+        """
+        alpha_t, sigma_t = self.get_schedule(t)
+        snr = (alpha_t ** 2) / (sigma_t ** 2 + 1e-8)
+        return snr
+    def alpha_to_time(self, alpha: torch.Tensor, num_steps: int = 100) -> torch.Tensor:
+        """
+        Inverse mapping: given alpha, find t
+        Used for inference when you want to specify noise levels directly.
+        Uses binary search since schedules are monotonic.
+        Args:
+            alpha: Desired alpha values
+            num_steps: Number of steps for binary search
+        Returns:
+            t: Corresponding timesteps
+        """
+        device = alpha.device
+        # Binary search for t
+        t_candidates = torch.linspace(0, 1, num_steps, device=device)
+        alpha_candidates, _ = self.get_schedule(t_candidates)
+        # Find closest match
+        distances = torch.abs(alpha_candidates.unsqueeze(0) - alpha.unsqueeze(1))
+        indices = torch.argmin(distances, dim=1)
+        t = t_candidates[indices]
+        return t

src/vae_utils.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# src/vae_utils.py
+import torch
+import torch.nn.functional as F
+from diffusers import AutoencoderKL
+from typing import Optional
+import numpy as np
+class VAEManager:
+    """Utility class for VAE encoding/decoding operations"""
+    def __init__(self, model_name: str = "stabilityai/sd-vae-ft-mse", device: str = "cuda"):
+        self.device = device
+        self.model_name = model_name
+        self.vae = None
+        self._load_vae()
+    def _load_vae(self):
+        """Load VAE model"""
+        print(f"Loading VAE: {self.model_name}")
+        self.vae = AutoencoderKL.from_pretrained(self.model_name)
+        self.vae = self.vae.to(self.device)
+        self.vae.eval()
+        # Freeze VAE parameters
+        for param in self.vae.parameters():
+            param.requires_grad = False
+    def encode(self, images: torch.Tensor) -> torch.Tensor:
+        """
+        Encode images to latent space
+        Args:
+            images: Tensor of shape [B, 3, H, W] in range [-1, 1]
+        Returns:
+            latents: Tensor of shape [B, 4, H//8, W//8]
+        """
+        with torch.no_grad():
+            images = images.to(self.device)
+            latent_dist = self.vae.encode(images).latent_dist
+            latents = latent_dist.sample()
+            latents = latents * self.vae.config.scaling_factor
+        return latents
+    def decode(self, latents: torch.Tensor, upscale_factor: Optional[float] = None,
+               upscale_mode: str = 'bicubic') -> torch.Tensor:
+        """
+        Decode latents to images
+        Args:
+            latents: Tensor of shape [B, 4, H, W]
+            upscale_factor: Optional upscaling factor (e.g., 2.0 for 2x, 1.5 for 1.5x)
+                          If None, returns images at native resolution (H*8, W*8)
+            upscale_mode: Interpolation mode ('bicubic', 'bilinear', 'nearest')
+        Returns:
+            images: Tensor of shape [B, 3, H*8*upscale_factor, W*8*upscale_factor] in range [-1, 1]
+        """
+        with torch.no_grad():
+            latents = latents.to(self.device)
+            # Rescale latents
+            latents = latents / self.vae.config.scaling_factor
+            images = self.vae.decode(latents).sample
+            # Apply upscaling if requested
+            if upscale_factor is not None and upscale_factor != 1.0:
+                _, _, h, w = images.shape
+                new_h = int(h * upscale_factor)
+                new_w = int(w * upscale_factor)
+                images = F.interpolate(
+                    images,
+                    size=(new_h, new_w),
+                    mode=upscale_mode,
+                    align_corners=False if upscale_mode in ['bilinear', 'bicubic'] else None,
+                    antialias=True if upscale_mode in ['bilinear', 'bicubic'] else False
+                )
+        return images
+    def decode_to_pil(self, latents: torch.Tensor, upscale_factor: Optional[float] = None,
+                     upscale_mode: str = 'bicubic', target_size: Optional[tuple] = None):
+        """
+        Decode latents to PIL images
+        Args:
+            latents: Tensor of shape [B, 4, H, W]
+            upscale_factor: Optional upscaling factor (e.g., 2.0 for 2x)
+            upscale_mode: Interpolation mode ('bicubic', 'bilinear', 'nearest')
+            target_size: Optional target size as (height, width). Overrides upscale_factor if provided.
+        Returns:
+            pil_images: List of PIL images
+        """
+        from PIL import Image
+        # Decode to tensor
+        images = self.decode(latents, upscale_factor=upscale_factor, upscale_mode=upscale_mode)
+        # Apply target size if specified
+        if target_size is not None:
+            images = F.interpolate(
+                images,
+                size=target_size,
+                mode=upscale_mode,
+                align_corners=False if upscale_mode in ['bilinear', 'bicubic'] else None,
+                antialias=True if upscale_mode in ['bilinear', 'bicubic'] else False
+            )
+        # Convert to [0, 1] range
+        images = (images + 1.0) / 2.0
+        images = torch.clamp(images, 0, 1)
+        # Convert to PIL
+        pil_images = []
+        for i in range(images.shape[0]):
+            img_array = images[i].cpu().numpy().transpose(1, 2, 0)
+            img_array = (img_array * 255).astype(np.uint8)
+            pil_image = Image.fromarray(img_array)
+            pil_images.append(pil_image)
+        return pil_images
+    @property
+    def scaling_factor(self) -> float:
+        """Get VAE scaling factor"""
+        return self.vae.config.scaling_factor
+    @property
+    def latent_channels(self) -> int:
+        """Get number of latent channels"""
+        return 4  # Standard for Stable Diffusion VAE
+def create_vae_manager(model_name: str = "stabilityai/sd-vae-ft-mse", device: str = "cuda") -> VAEManager:
+    """Factory function to create VAE manager"""
+    return VAEManager(model_name, device)
+def save_images_from_latents(latents: torch.Tensor, save_dir: str, vae_manager: VAEManager, prefix: str = "sample"):
+    """
+    Save images from latents using VAE decoder
+    Args:
+        latents: Tensor of shape [B, 4, H, W]
+        save_dir: Directory to save images
+        vae_manager: VAE manager instance
+        prefix: Filename prefix
+    """
+    import os
+    os.makedirs(save_dir, exist_ok=True)
+    # Decode to PIL images
+    pil_images = vae_manager.decode_to_pil(latents)
+    # Save each image
+    for i, pil_image in enumerate(pil_images):
+        save_path = os.path.join(save_dir, f"{prefix}_{i:03d}.png")
+        pil_image.save(save_path)
+    print(f"Saved {len(pil_images)} images to {save_dir}")
+def create_image_grid(latents: torch.Tensor, vae_manager: VAEManager, nrow: int = 4) -> torch.Tensor:
+    """
+    Create image grid from latents
+    Args:
+        latents: Tensor of shape [B, 4, H, W]
+        vae_manager: VAE manager instance
+        nrow: Number of images per row
+    Returns:
+        grid: Image grid tensor
+    """
+    import torchvision.utils as vutils
+    # Decode latents
+    images = vae_manager.decode(latents)
+    # Convert to [0, 1] range
+    images = (images + 1.0) / 2.0
+    images = torch.clamp(images, 0, 1)
+    # Create grid
+    grid = vutils.make_grid(images, nrow=nrow, padding=2)
+    return grid

weights/bf16/config.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bf54162afaf045deefb715e9834ed60948d7494354e866e70e76ddaebe575a78
+size 2908

weights/bf16/expert_0.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a069731935a6285a64e2379c554371997ff32ad1f6c956422cfb83a8086549d
+size 1211979376

weights/bf16/expert_1.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a5d45e5b96ce31cc3c2c9d8f903fb75c7d0b757be96212ec345ee0e78037d48
+size 1211979376

weights/bf16/expert_2.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9fa3505dfa75f4b82894064cc3c3b70aa6f409796dc7cda8bc14ce3572268a44
+size 1211979376

weights/bf16/expert_3.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9f53a42c3690ff8e27187a6c42770c888a1ce2fca8c132e181433870a6b4797
+size 1211979376

weights/bf16/expert_4.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58a367b34eb486789f9e8709384ad45d69768ac302a896fac85bd512134cdb3b
+size 1211979376

weights/bf16/expert_5.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:37c9ce6fa79faa97a029de00fcdedc7e96dbc5de36deabc953ad2ee95c2ab0ad
+size 1211979376

weights/bf16/expert_6.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1fb83aad644a4fe22cd661cc4bedd49c73815dfc91bf81caf6a89dc21f1f90b3
+size 1211979376

weights/bf16/expert_7.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c1d37b9b495d74121080237dbed32a5042ecbd7ed8ed619519cc2946f26e199b
+size 1211979376

weights/bf16/router.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ff8aaa22f59e382227b3b9fe6527010a6929e8b0b7c4322213b392a0ca03a1bf
+size 258286840

weights/bf16/router_config.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e951c1c39ad5401b33bb3147f62803d76303a7b7ca0e457e4cc0aaf1e585bb5
+size 2744

weights/int8/expert_0.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a0942e1503de55b07393582bb231fc0c8358cb8f03b329c3e282f8c4a8b861c
+size 606080694

weights/int8/expert_1.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e6a9ad48d7b84574a122f52ef6d619cb8c5d9f3766c1a55af5f0b5d463fbd109
+size 606080672

weights/int8/expert_2.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:baa7887b97f60db4532682be3701f6e9fc9a9dec1446af00ff3f1515055f888e
+size 606080694

weights/int8/expert_3.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c36196779aba27cef9ea66f5775a9ba43886a7811904fb66a9b23b0095800da9
+size 606080694

weights/int8/expert_4.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d255e3a6d89f0bfbff7461dff0fb27fa206a4b9e98a83be87121327d9cac56f7
+size 606080694

weights/int8/expert_5.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5cabe2f9deba779a5ea14a4d2e038a9aefd0ed5a4d2cd1b7776cd10939ffb21
+size 606080694

weights/int8/expert_6.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f0470248ca040b321a5e72ce73f05c32c9d0cbe8515115021fb0f6065cc3599d
+size 606080694

weights/int8/expert_7.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6f0047faeb6e7d0e5acc0abcbdede83789a5fec6e6d93ef3c4d5903785dd4660
+size 606080694

weights/int8/router.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2a52cce497dd02a88804bc81669eba0ab4957dd2b3c54b8de781dabb5a8c15b2
+size 256740839