krystv
/

PMA-VAE

Model card Files Files and versions

xet

Community

krystv commited on 27 days ago

Commit

d46a7a4

verified ·

1 Parent(s): bc982c7

Upload PMA_VAE_Colab_Training.ipynb with huggingface_hub

Browse files

Files changed (1) hide show

PMA_VAE_Colab_Training.ipynb +218 -197

PMA_VAE_Colab_Training.ipynb CHANGED Viewed

@@ -17,7 +17,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 🎨 PMA-VAE: Parallel Mobile Artistic Variational Autoencoder\n",
     "\n",
     "**A novel attention-free architecture for image generation, super-resolution, artifact removal, and artistic style transfer.**\n",
     "\n",
@@ -31,17 +31,17 @@
     "\n",
     "## Architecture\n",
     "```\n",
-    "Image → PixelUnshuffle stem → MobileConv stages → Parallel 2D Mamba blocks\n",
-    "  → Multi-scale latent (z_base H/16, z_detail H/8, z_style global)\n",
-    "  → Light parallel decoder with FiLM style modulation → Reconstructed image\n",
     "```\n",
     "\n",
     "## Key Design Decisions\n",
-    "- **Parallel scan SSM** (Blelloch algorithm) — pure PyTorch, no CUDA kernels needed\n",
-    "- **Cross-scan 2D** (VMamba-style) — 4 directional scans for global context without attention\n",
-    "- **PixelShuffle upsampling** — efficient sub-pixel convolution for mobile\n",
-    "- **Taming-transformers loss recipe** — adaptive discriminator weight balancing\n",
-    "- **Progressive resolution training** — start small, scale up\n",
     "\n",
     "---\n",
     "**Trainable on free Colab T4 GPU (15GB VRAM) in ~2-4 hours for meaningful results.**"
@@ -68,7 +68,7 @@
     "    print(f'GPU: {torch.cuda.get_device_name(0)}')\n",
     "    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')\n",
     "else:\n",
-    "    print('⚠️ No GPU detected! Go to Runtime → Change runtime type → T4 GPU')"
    ]
   },
   {
@@ -90,12 +90,12 @@
     "The full model is defined below in a single cell for easy Colab use.\n",
     "\n",
     "### Component breakdown:\n",
-    "1. **Parallel Scan (PScan)** — Blelloch parallel prefix scan in pure PyTorch\n",
-    "2. **Selective SSM (S6)** — Mamba's core mechanism, input-dependent state space\n",
-    "3. **2D Cross-Scan** — VMamba-style 4-directional scanning for 2D feature maps\n",
-    "4. **Mobile Conv Blocks** — Depthwise separable + SE + FiLM conditioning\n",
-    "5. **Encoder** — Progressive downsampling with hybrid MobileConv + Mamba stages\n",
-    "6. **Decoder** — Lightweight with FiLM style modulation, PixelShuffle upsampling"
    ]
   },
   {
@@ -112,7 +112,7 @@
     "\n",
     "\n",
     "# ============================================================================\n",
-    "# Parallel Scan (Blelloch) — Pure PyTorch, no CUDA kernels\n",
     "# ============================================================================\n",
     "\n",
     "class PScan(torch.autograd.Function):\n",
@@ -379,13 +379,13 @@
     "            nn.Conv2d(in_channels * 4, stage_channels[0], 3, padding=1, bias=False),\n",
     "            nn.BatchNorm2d(stage_channels[0]), nn.SiLU(inplace=True))\n",
     "\n",
-    "        # Stage 1: H/2 → H/4 (MobileConv only)\n",
     "        s1 = [MobileConvBlock(stage_channels[0], stage_channels[1], stride=2)]\n",
     "        for _ in range(stage_blocks[0] - 1):\n",
     "            s1.append(MobileConvBlock(stage_channels[1], stage_channels[1]))\n",
     "        self.stage1 = nn.Sequential(*s1)\n",
     "\n",
-    "        # Stage 2: H/4 → H/8 (hybrid MobileConv + Mamba)\n",
     "        s2 = nn.ModuleList()\n",
     "        s2.append(MobileConvBlock(stage_channels[1], stage_channels[2], stride=2))\n",
     "        n_mamba = max(1, (stage_blocks[1] - 1) // 2)\n",
@@ -398,7 +398,7 @@
     "        self.detail_head_mu = nn.Conv2d(stage_channels[2], latent_detail_dim, 1)\n",
     "        self.detail_head_logvar = nn.Conv2d(stage_channels[2], latent_detail_dim, 1)\n",
     "\n",
-    "        # Stage 3: H/8 → H/16 (Mamba-heavy)\n",
     "        s3 = nn.ModuleList()\n",
     "        s3.append(MobileConvBlock(stage_channels[2], stage_channels[3], stride=2))\n",
     "        n_mamba3 = max(1, int((stage_blocks[2] - 1) * 0.75))\n",
@@ -533,18 +533,18 @@
     "# ============================================================================\n",
     "\n",
     "def pmavae_small(use_parallel_scan=True):\n",
-    "    \"\"\"~6M params — fast training on free Colab T4\"\"\"\n",
     "    return PMAVAE(enc_channels=(48, 96, 144, 192), dec_channels=(192, 144, 96, 72, 48),\n",
     "                  enc_blocks=(2, 2, 3, 3), latent_base_dim=24, latent_detail_dim=6,\n",
     "                  latent_style_dim=96, d_state=16, use_parallel_scan=use_parallel_scan)\n",
     "\n",
     "def pmavae_base(use_parallel_scan=True):\n",
-    "    \"\"\"~15M params — high quality, needs more VRAM\"\"\"\n",
     "    return PMAVAE(enc_channels=(64, 128, 192, 256), dec_channels=(256, 192, 128, 96, 64),\n",
     "                  enc_blocks=(2, 2, 4, 4), latent_base_dim=32, latent_detail_dim=8,\n",
     "                  latent_style_dim=128, d_state=16, use_parallel_scan=use_parallel_scan)\n",
     "\n",
-    "print('✅ PMA-VAE architecture defined!')"
    ]
   },
   {
@@ -568,7 +568,7 @@
     "    print(f'  {k}: {v.shape}')\n",
     "\n",
     "params = model.count_parameters()\n",
-    "print(f'\\n📊 Parameters: {params[\"total_M\"]:.2f}M total')\n",
     "print(f'   Encoder: {params[\"enc_M\"]:.2f}M | Decoder: {params[\"dec_M\"]:.2f}M')\n",
     "\n",
     "del model, x, recon\n",
@@ -582,12 +582,12 @@
     "## 3. Loss Functions\n",
     "\n",
     "Our loss combines:\n",
-    "- **L1 reconstruction** — pixel-level fidelity\n",
-    "- **VGG perceptual** — semantic/structural similarity  \n",
-    "- **PatchGAN discriminator** — sharp, realistic textures\n",
-    "- **KL with free bits** — prevents posterior collapse\n",
-    "- **Edge preservation** — high-frequency detail via Sobel filters\n",
-    "- **Adaptive discriminator weight** — taming-transformers trick"
    ]
   },
   {
@@ -701,7 +701,7 @@
     "            d = hinge_d_loss(self.discriminator(inputs.detach()), self.discriminator(recon.detach()))\n",
     "            return d, {'d_loss': d.item()}\n",
     "\n",
-    "print('✅ Loss functions defined!')"
    ]
   },
   {
@@ -711,8 +711,8 @@
     "## 4. Dataset Setup\n",
     "\n",
     "We use a HuggingFace dataset for training. Options:\n",
-    "- `huggan/wikiart` — artistic images (great for style learning)\n",
-    "- `ILSVRC/imagenet-1k` — diverse natural images\n",
     "- Any folder of images\n",
     "\n",
     "For free Colab, we use a moderate-sized art dataset."
@@ -724,35 +724,61 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from torch.utils.data import DataLoader, Dataset\n",
     "from torchvision import transforms\n",
     "from PIL import Image\n",
     "import os\n",
     "\n",
-    "# ======== Option A: HuggingFace Dataset ========\n",
-    "class HFImageDataset(Dataset):\n",
-    "    def __init__(self, hf_dataset, image_col='image', resolution=256):\n",
-    "        self.ds = hf_dataset\n",
     "        self.col = image_col\n",
     "        self.transform = transforms.Compose([\n",
-    "            transforms.Resize(int(resolution * 1.15), interpolation=transforms.InterpolationMode.LANCZOS, antialias=True),\n",
     "            transforms.RandomCrop(resolution),\n",
     "            transforms.RandomHorizontalFlip(),\n",
     "            transforms.ToTensor(),\n",
     "            transforms.Normalize([0.5]*3, [0.5]*3)])\n",
-    "    def __len__(self): return len(self.ds)\n",
-    "    def __getitem__(self, idx):\n",
-    "        img = self.ds[idx][self.col]\n",
-    "        if not isinstance(img, Image.Image): img = Image.fromarray(img)\n",
-    "        return self.transform(img.convert('RGB'))\n",
     "\n",
-    "# ======== Option B: Local folder ========\n",
     "class FolderDataset(Dataset):\n",
     "    def __init__(self, root, resolution=256):\n",
     "        exts = {'.jpg','.jpeg','.png','.bmp','.webp'}\n",
-    "        self.files = [os.path.join(dp,f) for dp,_,fns in os.walk(root) for f in fns if os.path.splitext(f)[1].lower() in exts]\n",
     "        self.transform = transforms.Compose([\n",
-    "            transforms.Resize(int(resolution * 1.15), interpolation=transforms.InterpolationMode.LANCZOS, antialias=True),\n",
     "            transforms.RandomCrop(resolution),\n",
     "            transforms.RandomHorizontalFlip(),\n",
     "            transforms.ToTensor(),\n",
@@ -761,7 +787,7 @@
     "    def __getitem__(self, idx):\n",
     "        return self.transform(Image.open(self.files[idx]).convert('RGB'))\n",
     "\n",
-    "print('✅ Dataset classes defined!')"
    ]
   },
   {
@@ -770,33 +796,58 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Load dataset\n",
     "from datasets import load_dataset\n",
     "\n",
-    "# === Choose your dataset ===\n",
-    "DATASET_NAME = 'huggan/wikiart'  # Art images - great for style learning\n",
     "IMAGE_COLUMN = 'image'\n",
-    "RESOLUTION = 256  # Start with 256, can increase to 512 later\n",
-    "BATCH_SIZE = 8    # Fits on T4 with small model\n",
-    "NUM_WORKERS = 2\n",
     "\n",
-    "print(f'Loading {DATASET_NAME}...')\n",
-    "raw_dataset = load_dataset(DATASET_NAME, split='train', streaming=False)\n",
-    "# For very large datasets, use streaming=True and take a subset:\n",
-    "# raw_dataset = load_dataset(DATASET_NAME, split='train', streaming=True)\n",
-    "# raw_dataset = list(raw_dataset.take(50000))\n",
     "\n",
-    "dataset = HFImageDataset(raw_dataset, IMAGE_COLUMN, RESOLUTION)\n",
-    "dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True,\n",
-    "                        num_workers=NUM_WORKERS, pin_memory=True, drop_last=True)\n",
     "\n",
-    "print(f'Dataset: {len(dataset)} images')\n",
-    "print(f'Batches per epoch: {len(dataloader)}')\n",
     "\n",
-    "# Quick check\n",
     "sample = next(iter(dataloader))\n",
-    "print(f'Batch shape: {sample.shape}')\n",
-    "print(f'Value range: [{sample.min():.2f}, {sample.max():.2f}]')"
    ]
   },
   {
@@ -806,12 +857,12 @@
     "## 5. Training\n",
     "\n",
     "### Training recipe:\n",
-    "- **Phase 1** (256×256): Learn structure and composition\n",
-    "- **Phase 2** (384×384): Refine texture details\n",
-    "- **Phase 3** (512×512): Fine-tune for high-res quality\n",
     "\n",
     "### Anti-collapse measures:\n",
-    "1. **KL warmup**: β goes from 0 → target over first 5000 steps\n",
     "2. **Free bits**: Each latent dimension must use at least 0.25 nats\n",
     "3. **Discriminator cold start**: Only activates after 10000 steps\n",
     "4. **Adaptive disc weight**: Balances recon vs adversarial gradients\n",
@@ -921,128 +972,98 @@
    "outputs": [],
    "source": [
     "# ============================================================================\n",
-    "# Training Loop with Live Visualization\n",
     "# ============================================================================\n",
     "\n",
-    "def visualize_reconstruction(model, batch, step):\n",
-    "    \"\"\"Show original vs reconstructed images.\"\"\"\n",
-    "    model.eval()\n",
-    "    with torch.no_grad():\n",
-    "        recon, _ = model(batch[:4].to(device))\n",
-    "    model.train()\n",
-    "    \n",
-    "    fig, axes = plt.subplots(2, 4, figsize=(16, 8))\n",
-    "    for i in range(4):\n",
-    "        orig = batch[i].permute(1,2,0).cpu().numpy() * 0.5 + 0.5\n",
-    "        rec = recon[i].permute(1,2,0).cpu().numpy() * 0.5 + 0.5\n",
-    "        axes[0,i].imshow(orig.clip(0,1))\n",
-    "        axes[0,i].set_title('Original')\n",
-    "        axes[0,i].axis('off')\n",
-    "        axes[1,i].imshow(rec.clip(0,1))\n",
-    "        axes[1,i].set_title(f'Recon (step {step})')\n",
-    "        axes[1,i].axis('off')\n",
-    "    plt.tight_layout()\n",
-    "    plt.show()\n",
-    "\n",
-    "def plot_losses(history):\n",
-    "    \"\"\"Plot training loss curves.\"\"\"\n",
-    "    if len(history) < 10: return\n",
-    "    fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
-    "    steps = [h['step'] for h in history]\n",
-    "    axes[0].plot(steps, [h['l1'] for h in history], label='L1')\n",
-    "    axes[0].plot(steps, [h.get('perc',0) for h in history], label='Perceptual')\n",
-    "    axes[0].set_title('Reconstruction Losses'); axes[0].legend(); axes[0].set_xlabel('Step')\n",
-    "    axes[1].plot(steps, [h.get('kl_base',0) for h in history], label='KL base')\n",
-    "    axes[1].plot(steps, [h.get('kl_detail',0) for h in history], label='KL detail')\n",
-    "    axes[1].set_title('KL Losses'); axes[1].legend(); axes[1].set_xlabel('Step')\n",
-    "    axes[2].plot(steps, [h.get('d_loss',0) for h in history], label='Disc')\n",
-    "    axes[2].plot(steps, [h.get('g_loss',0) for h in history], label='Gen')\n",
-    "    axes[2].set_title('GAN Losses'); axes[2].legend(); axes[2].set_xlabel('Step')\n",
-    "    plt.tight_layout(); plt.show()\n",
-    "\n",
-    "# === TRAINING LOOP ===\n",
     "global_step = 0\n",
     "history = []\n",
     "start_time = time.time()\n",
-    "vis_batch = next(iter(dataloader))  # Fixed batch for visualization\n",
     "\n",
-    "print(f'\\n🚀 Starting training! Target: {CONFIG[\"max_steps\"]} steps')\n",
-    "print(f'   KL warmup: 0 → {CONFIG[\"kl_weight\"]} over {CONFIG[\"kl_warmup_steps\"]} steps')\n",
     "print(f'   Discriminator starts at step {CONFIG[\"disc_start\"]}\\n')\n",
     "\n",
     "model.train()\n",
-    "for epoch in range(CONFIG['num_epochs']):\n",
-    "    for batch_idx, batch in enumerate(dataloader):\n",
-    "        batch = batch.to(device)\n",
-    "        \n",
-    "        # KL warmup\n",
-    "        kl_w = CONFIG['kl_weight'] * min(1.0, global_step / max(1, CONFIG['kl_warmup_steps']))\n",
-    "        criterion.kl_weight = kl_w\n",
-    "        \n",
-    "        # === VAE update ===\n",
-    "        opt_vae.zero_grad()\n",
-    "        with autocast('cuda', enabled=device=='cuda'):\n",
-    "            recon, posteriors = model(batch)\n",
-    "            loss_vae, log_vae = criterion(batch, recon, posteriors, 0, global_step,\n",
-    "                                          model.get_last_decoder_layer())\n",
-    "        scaler_vae.scale(loss_vae).backward()\n",
-    "        scaler_vae.unscale_(opt_vae)\n",
-    "        gn = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
-    "        scaler_vae.step(opt_vae)\n",
-    "        scaler_vae.update()\n",
-    "        \n",
-    "        # === Discriminator update ===\n",
-    "        opt_disc.zero_grad()\n",
-    "        with autocast('cuda', enabled=device=='cuda'):\n",
-    "            with torch.no_grad():\n",
-    "                recon_d, _ = model(batch)\n",
-    "            loss_disc, log_disc = criterion(batch, recon_d, posteriors, 1, global_step)\n",
-    "        if global_step >= CONFIG['disc_start']:\n",
-    "            scaler_disc.scale(loss_disc).backward()\n",
-    "            scaler_disc.unscale_(opt_disc)\n",
-    "            torch.nn.utils.clip_grad_norm_(criterion.discriminator.parameters(), 1.0)\n",
-    "            scaler_disc.step(opt_disc)\n",
-    "            scaler_disc.update()\n",
-    "        \n",
-    "        global_step += 1\n",
-    "        \n",
-    "        # Logging\n",
-    "        log = {**log_vae, **log_disc, 'step': global_step, 'grad_norm': gn.item(), 'kl_w': kl_w}\n",
-    "        \n",
-    "        if global_step % CONFIG['log_every'] == 0:\n",
-    "            history.append(log)\n",
-    "            elapsed = (time.time() - start_time) / 60\n",
-    "            print(f\"Step {global_step:6d} | L1:{log['l1']:.4f} | Perc:{log.get('perc',0):.4f} | \"\n",
-    "                  f\"KL:{log.get('kl_base',0):.1f}/{log.get('kl_detail',0):.1f}/{log.get('kl_style',0):.1f} | \"\n",
-    "                  f\"D:{log.get('d_loss',0):.4f} | G:{log.get('g_loss',0):.4f} | \"\n",
-    "                  f\"GN:{log['grad_norm']:.2f} | {elapsed:.1f}min\")\n",
-    "        \n",
-    "        if global_step % CONFIG['vis_every'] == 0:\n",
-    "            clear_output(wait=True)\n",
-    "            visualize_reconstruction(model, vis_batch, global_step)\n",
-    "            plot_losses(history)\n",
-    "        \n",
-    "        if global_step % CONFIG['save_every'] == 0:\n",
-    "            os.makedirs('checkpoints', exist_ok=True)\n",
-    "            torch.save({'model': model.state_dict(),\n",
-    "                       'disc': criterion.discriminator.state_dict(),\n",
-    "                       'opt_vae': opt_vae.state_dict(),\n",
-    "                       'opt_disc': opt_disc.state_dict(),\n",
-    "                       'step': global_step, 'config': CONFIG},\n",
-    "                      f'checkpoints/pma_vae_step{global_step}.pt')\n",
-    "            print(f'💾 Saved checkpoint at step {global_step}')\n",
-    "        \n",
-    "        if global_step >= CONFIG['max_steps']:\n",
-    "            break\n",
-    "    \n",
-    "    if global_step >= CONFIG['max_steps']:\n",
-    "        break\n",
     "\n",
     "# Final save\n",
     "torch.save({'model': model.state_dict(), 'config': CONFIG}, 'checkpoints/pma_vae_final.pt')\n",
     "total_time = (time.time() - start_time) / 60\n",
-    "print(f'\\n✅ Training complete! {global_step} steps in {total_time:.1f} minutes')\n",
-    "print(f'💾 Final model saved to checkpoints/pma_vae_final.pt')"
    ]
   },
   {
@@ -1080,7 +1101,7 @@
     "    psnr = -10 * math.log10(mse + 1e-8)\n",
     "    psnrs.append(psnr)\n",
     "\n",
-    "print(f'\\n📊 Evaluation Results:')\n",
     "print(f'   Average PSNR: {sum(psnrs)/len(psnrs):.2f} dB')\n",
     "print(f'   Min PSNR: {min(psnrs):.2f} dB')\n",
     "print(f'   Max PSNR: {max(psnrs):.2f} dB')"
@@ -1144,7 +1165,7 @@
     "        out = model.decoder(pa['base_mu'], pa['detail_mu'], z_style)\n",
     "    img = out[0].cpu().permute(1,2,0).numpy() * 0.5 + 0.5\n",
     "    axes[i].imshow(img.clip(0,1))\n",
-    "    axes[i].set_title(f'α={alpha:.2f}')\n",
     "    axes[i].axis('off')\n",
     "plt.suptitle('Style Interpolation (structure fixed, style varies)', fontsize=14)\n",
     "plt.tight_layout()\n",
@@ -1197,7 +1218,7 @@
     "model.eval()\n",
     "\n",
     "# Dummy inputs matching the latent shapes\n",
-    "dummy_base = torch.randn(1, 24, 16, 16, device=device)   # For 256×256 input\n",
     "dummy_detail = torch.randn(1, 6, 32, 32, device=device)\n",
     "dummy_style = torch.randn(1, 96, device=device)\n",
     "\n",
@@ -1215,7 +1236,7 @@
     ")\n",
     "\n",
     "onnx_size = os.path.getsize('pma_vae_decoder.onnx') / 1024**2\n",
-    "print(f'\\n📱 ONNX decoder exported!')\n",
     "print(f'   Size: {onnx_size:.1f} MB')\n",
     "print(f'   Ready for: Core ML, TFLite, ONNX Runtime Mobile')\n",
     "\n",
@@ -1228,7 +1249,7 @@
    "source": [
     "## 9. Progressive Resolution Training\n",
     "\n",
-    "After initial training at 256×256, progressively increase resolution.\n",
     "The model handles variable resolutions thanks to the convolutional architecture."
    ]
   },
@@ -1253,7 +1274,7 @@
     "# for pg in opt_disc.param_groups: pg['lr'] *= 0.5\n",
     "# \n",
     "# # Continue training (copy the training loop above with dataloader_hr)\n",
-    "# print(f'Phase 2: Training at {NEW_RESOLUTION}×{NEW_RESOLUTION}')\n",
     "# print(f'Batches per epoch: {len(dataloader_hr)}')"
    ]
   },
@@ -1276,8 +1297,8 @@
     "model.eval()\n",
     "\n",
     "# Take a high-res image and downsample it\n",
-    "hr_img = test_batch[0:1]  # 256×256\n",
-    "lr_img = F.interpolate(hr_img, scale_factor=0.5, mode='bilinear', align_corners=False)  # 128×128\n",
     "lr_upscaled = F.interpolate(lr_img, size=(256, 256), mode='bilinear', align_corners=False)\n",
     "\n",
     "with torch.no_grad():\n",
@@ -1313,7 +1334,7 @@
     "| Component | Choice | Why |\n",
     "|---|---|---|\n",
     "| Backbone | MobileConv + Parallel 2D Mamba | Fast, efficient, attention-free |\n",
-    "| Downsampling | PixelUnshuffle → stride-2 conv | Lossless initial features |\n",
     "| Upsampling | PixelShuffle (sub-pixel) | Mobile-friendly, no checkerboard |\n",
     "| Latent | Multi-scale (base/detail/style) | Controllable, prevents collapse |\n",
     "| Style control | FiLM conditioning | Lightweight, multiplicative |\n",
@@ -1326,13 +1347,13 @@
     "\n",
     "| Feature | PMA-VAE | SD-VAE | NVAE |\n",
     "|---|---|---|---|\n",
-    "| Attention-free | ✅ | ❌ | ❌ |\n",
-    "| Mobile-friendly decoder | ✅ | ❌ | ❌ |\n",
-    "| Multi-scale latent | ✅ | ❌ | ✅ |\n",
-    "| Style control built-in | ✅ | ❌ | ❌ |\n",
     "| Decoder params | ~4-8M | ~50M | ~100M+ |\n",
-    "| Parallel training | ✅ | ✅ | ✅ |\n",
-    "| Free Colab trainable | ✅ | ❌ | ❌ |"
    ]
   },
   {
@@ -1341,7 +1362,7 @@
    "source": [
     "---\n",
     "\n",
-    "## 📚 References\n",
     "\n",
     "- **Mamba**: Gu & Dao, 2023. [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752)\n",
     "- **VMamba**: Liu et al., 2024. [VMamba: Visual State Space Model](https://arxiv.org/abs/2401.10166)\n",
@@ -1355,4 +1376,4 @@
    ]
   }
  ]
-}

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# \ud83c\udfa8 PMA-VAE: Parallel Mobile Artistic Variational Autoencoder\n",
     "\n",
     "**A novel attention-free architecture for image generation, super-resolution, artifact removal, and artistic style transfer.**\n",
     "\n",
     "\n",
     "## Architecture\n",
     "```\n",
+    "Image \u2192 PixelUnshuffle stem \u2192 MobileConv stages \u2192 Parallel 2D Mamba blocks\n",
+    "  \u2192 Multi-scale latent (z_base H/16, z_detail H/8, z_style global)\n",
+    "  \u2192 Light parallel decoder with FiLM style modulation \u2192 Reconstructed image\n",
     "```\n",
     "\n",
     "## Key Design Decisions\n",
+    "- **Parallel scan SSM** (Blelloch algorithm) \u2014 pure PyTorch, no CUDA kernels needed\n",
+    "- **Cross-scan 2D** (VMamba-style) \u2014 4 directional scans for global context without attention\n",
+    "- **PixelShuffle upsampling** \u2014 efficient sub-pixel convolution for mobile\n",
+    "- **Taming-transformers loss recipe** \u2014 adaptive discriminator weight balancing\n",
+    "- **Progressive resolution training** \u2014 start small, scale up\n",
     "\n",
     "---\n",
     "**Trainable on free Colab T4 GPU (15GB VRAM) in ~2-4 hours for meaningful results.**"
     "    print(f'GPU: {torch.cuda.get_device_name(0)}')\n",
     "    print(f'VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB')\n",
     "else:\n",
+    "    print('\u26a0\ufe0f No GPU detected! Go to Runtime \u2192 Change runtime type \u2192 T4 GPU')"
    ]
   },
   {
     "The full model is defined below in a single cell for easy Colab use.\n",
     "\n",
     "### Component breakdown:\n",
+    "1. **Parallel Scan (PScan)** \u2014 Blelloch parallel prefix scan in pure PyTorch\n",
+    "2. **Selective SSM (S6)** \u2014 Mamba's core mechanism, input-dependent state space\n",
+    "3. **2D Cross-Scan** \u2014 VMamba-style 4-directional scanning for 2D feature maps\n",
+    "4. **Mobile Conv Blocks** \u2014 Depthwise separable + SE + FiLM conditioning\n",
+    "5. **Encoder** \u2014 Progressive downsampling with hybrid MobileConv + Mamba stages\n",
+    "6. **Decoder** \u2014 Lightweight with FiLM style modulation, PixelShuffle upsampling"
    ]
   },
   {
     "\n",
     "\n",
     "# ============================================================================\n",
+    "# Parallel Scan (Blelloch) \u2014 Pure PyTorch, no CUDA kernels\n",
     "# ============================================================================\n",
     "\n",
     "class PScan(torch.autograd.Function):\n",
     "            nn.Conv2d(in_channels * 4, stage_channels[0], 3, padding=1, bias=False),\n",
     "            nn.BatchNorm2d(stage_channels[0]), nn.SiLU(inplace=True))\n",
     "\n",
+    "        # Stage 1: H/2 \u2192 H/4 (MobileConv only)\n",
     "        s1 = [MobileConvBlock(stage_channels[0], stage_channels[1], stride=2)]\n",
     "        for _ in range(stage_blocks[0] - 1):\n",
     "            s1.append(MobileConvBlock(stage_channels[1], stage_channels[1]))\n",
     "        self.stage1 = nn.Sequential(*s1)\n",
     "\n",
+    "        # Stage 2: H/4 \u2192 H/8 (hybrid MobileConv + Mamba)\n",
     "        s2 = nn.ModuleList()\n",
     "        s2.append(MobileConvBlock(stage_channels[1], stage_channels[2], stride=2))\n",
     "        n_mamba = max(1, (stage_blocks[1] - 1) // 2)\n",
     "        self.detail_head_mu = nn.Conv2d(stage_channels[2], latent_detail_dim, 1)\n",
     "        self.detail_head_logvar = nn.Conv2d(stage_channels[2], latent_detail_dim, 1)\n",
     "\n",
+    "        # Stage 3: H/8 \u2192 H/16 (Mamba-heavy)\n",
     "        s3 = nn.ModuleList()\n",
     "        s3.append(MobileConvBlock(stage_channels[2], stage_channels[3], stride=2))\n",
     "        n_mamba3 = max(1, int((stage_blocks[2] - 1) * 0.75))\n",
     "# ============================================================================\n",
     "\n",
     "def pmavae_small(use_parallel_scan=True):\n",
+    "    \"\"\"~6M params \u2014 fast training on free Colab T4\"\"\"\n",
     "    return PMAVAE(enc_channels=(48, 96, 144, 192), dec_channels=(192, 144, 96, 72, 48),\n",
     "                  enc_blocks=(2, 2, 3, 3), latent_base_dim=24, latent_detail_dim=6,\n",
     "                  latent_style_dim=96, d_state=16, use_parallel_scan=use_parallel_scan)\n",
     "\n",
     "def pmavae_base(use_parallel_scan=True):\n",
+    "    \"\"\"~15M params \u2014 high quality, needs more VRAM\"\"\"\n",
     "    return PMAVAE(enc_channels=(64, 128, 192, 256), dec_channels=(256, 192, 128, 96, 64),\n",
     "                  enc_blocks=(2, 2, 4, 4), latent_base_dim=32, latent_detail_dim=8,\n",
     "                  latent_style_dim=128, d_state=16, use_parallel_scan=use_parallel_scan)\n",
     "\n",
+    "print('\u2705 PMA-VAE architecture defined!')"
    ]
   },
   {
     "    print(f'  {k}: {v.shape}')\n",
     "\n",
     "params = model.count_parameters()\n",
+    "print(f'\\n\ud83d\udcca Parameters: {params[\"total_M\"]:.2f}M total')\n",
     "print(f'   Encoder: {params[\"enc_M\"]:.2f}M | Decoder: {params[\"dec_M\"]:.2f}M')\n",
     "\n",
     "del model, x, recon\n",
     "## 3. Loss Functions\n",
     "\n",
     "Our loss combines:\n",
+    "- **L1 reconstruction** \u2014 pixel-level fidelity\n",
+    "- **VGG perceptual** \u2014 semantic/structural similarity  \n",
+    "- **PatchGAN discriminator** \u2014 sharp, realistic textures\n",
+    "- **KL with free bits** \u2014 prevents posterior collapse\n",
+    "- **Edge preservation** \u2014 high-frequency detail via Sobel filters\n",
+    "- **Adaptive discriminator weight** \u2014 taming-transformers trick"
    ]
   },
   {
     "            d = hinge_d_loss(self.discriminator(inputs.detach()), self.discriminator(recon.detach()))\n",
     "            return d, {'d_loss': d.item()}\n",
     "\n",
+    "print('\u2705 Loss functions defined!')"
    ]
   },
   {
     "## 4. Dataset Setup\n",
     "\n",
     "We use a HuggingFace dataset for training. Options:\n",
+    "- `huggan/wikiart` \u2014 artistic images (great for style learning)\n",
+    "- `ILSVRC/imagenet-1k` \u2014 diverse natural images\n",
     "- Any folder of images\n",
     "\n",
     "For free Colab, we use a moderate-sized art dataset."
    "metadata": {},
    "outputs": [],
    "source": [
+    "from torch.utils.data import DataLoader, Dataset, IterableDataset\n",
     "from torchvision import transforms\n",
     "from PIL import Image\n",
     "import os\n",
     "\n",
+    "# ======== Streaming HF Dataset (RAM-safe) ========\n",
+    "# This NEVER loads the full dataset into RAM.\n",
+    "# Images are decoded one-at-a-time from Parquet shards.\n",
+    "\n",
+    "class StreamingHFDataset(IterableDataset):\n",
+    "    \"\"\"\n",
+    "    Wraps a HuggingFace streaming dataset for PyTorch.\n",
+    "    RAM usage: ~50-100MB regardless of dataset size.\n",
+    "    \n",
+    "    Key: we use datasets streaming mode which reads Parquet\n",
+    "    files chunk-by-chunk from HF Hub, never materializing\n",
+    "    the full dataset in memory.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, hf_iterable_dataset, image_col='image', resolution=256):\n",
+    "        self.ds = hf_iterable_dataset\n",
     "        self.col = image_col\n",
     "        self.transform = transforms.Compose([\n",
+    "            transforms.Resize(int(resolution * 1.15),\n",
+    "                              interpolation=transforms.InterpolationMode.LANCZOS,\n",
+    "                              antialias=True),\n",
     "            transforms.RandomCrop(resolution),\n",
     "            transforms.RandomHorizontalFlip(),\n",
     "            transforms.ToTensor(),\n",
     "            transforms.Normalize([0.5]*3, [0.5]*3)])\n",
     "\n",
+    "    def __iter__(self):\n",
+    "        for sample in self.ds:\n",
+    "            img = sample[self.col]\n",
+    "            if not isinstance(img, Image.Image):\n",
+    "                img = Image.fromarray(img)\n",
+    "            img = img.convert('RGB')\n",
+    "            # Ensure minimum size for crop\n",
+    "            w, h = img.size\n",
+    "            if w < 64 or h < 64:\n",
+    "                continue  # skip tiny images\n",
+    "            try:\n",
+    "                yield self.transform(img)\n",
+    "            except Exception:\n",
+    "                continue  # skip corrupt images\n",
+    "\n",
+    "# ======== Local folder (non-streaming) ========\n",
     "class FolderDataset(Dataset):\n",
     "    def __init__(self, root, resolution=256):\n",
     "        exts = {'.jpg','.jpeg','.png','.bmp','.webp'}\n",
+    "        self.files = [os.path.join(dp,f) for dp,_,fns in os.walk(root)\n",
+    "                      for f in fns if os.path.splitext(f)[1].lower() in exts]\n",
     "        self.transform = transforms.Compose([\n",
+    "            transforms.Resize(int(resolution * 1.15),\n",
+    "                              interpolation=transforms.InterpolationMode.LANCZOS,\n",
+    "                              antialias=True),\n",
     "            transforms.RandomCrop(resolution),\n",
     "            transforms.RandomHorizontalFlip(),\n",
     "            transforms.ToTensor(),\n",
     "    def __getitem__(self, idx):\n",
     "        return self.transform(Image.open(self.files[idx]).convert('RGB'))\n",
     "\n",
+    "print('\u2705 Dataset classes defined!')"
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
     "from datasets import load_dataset\n",
     "\n",
+    "# ============================================================================\n",
+    "# Dataset Configuration\n",
+    "# ============================================================================\n",
+    "DATASET_NAME = 'huggan/wikiart'   # 80K art images (~5GB)\n",
     "IMAGE_COLUMN = 'image'\n",
+    "RESOLUTION = 256\n",
+    "BATCH_SIZE = 8                     # Fits T4 15GB with pmavae_small\n",
     "\n",
+    "# ============================================================================\n",
+    "# CRITICAL: Use streaming=True to avoid RAM crash!\n",
+    "# \n",
+    "# Without streaming: HF downloads ALL 5GB of images \u2192 decodes to PIL \u2192\n",
+    "# stores in RAM \u2192 Colab's 12GB RAM is exhausted \u2192 kernel crash.\n",
+    "# \n",
+    "# With streaming: HF reads Parquet shards on-the-fly \u2192 decodes one\n",
+    "# image at a time \u2192 constant ~100MB RAM usage.\n",
+    "# ============================================================================\n",
+    "print(f'Loading {DATASET_NAME} in streaming mode...')\n",
+    "raw_stream = load_dataset(DATASET_NAME, split='train', streaming=True)\n",
     "\n",
+    "# Shuffle with a buffer (keeps only 1000 samples in RAM at once)\n",
+    "raw_stream = raw_stream.shuffle(seed=42, buffer_size=1000)\n",
     "\n",
+    "dataset = StreamingHFDataset(raw_stream, IMAGE_COLUMN, RESOLUTION)\n",
     "\n",
+    "# ============================================================================\n",
+    "# DataLoader for streaming dataset\n",
+    "# \n",
+    "# IMPORTANT differences from map-style DataLoader:\n",
+    "# - num_workers=0 (streaming datasets handle their own I/O)\n",
+    "# - No shuffle (already shuffled in the stream buffer above)\n",
+    "# - drop_last=True (partial batches can cause issues)\n",
+    "# ============================================================================\n",
+    "dataloader = DataLoader(\n",
+    "    dataset,\n",
+    "    batch_size=BATCH_SIZE,\n",
+    "    num_workers=0,        # streaming handles I/O internally\n",
+    "    pin_memory=True,\n",
+    "    drop_last=True,\n",
+    ")\n",
+    "\n",
+    "# Quick sanity check \u2014 grab one batch\n",
+    "print('Fetching first batch...')\n",
     "sample = next(iter(dataloader))\n",
+    "print(f'\u2705 Batch shape: {sample.shape}')\n",
+    "print(f'   Value range: [{sample.min():.2f}, {sample.max():.2f}]')\n",
+    "print(f'   RAM usage: minimal (streaming mode)')\n",
+    "print()\n",
+    "print('NOTE: With streaming, len(dataloader) is unknown.')\n",
+    "print('Training runs by step count, not epoch count.')"
    ]
   },
   {
     "## 5. Training\n",
     "\n",
     "### Training recipe:\n",
+    "- **Phase 1** (256\u00d7256): Learn structure and composition\n",
+    "- **Phase 2** (384\u00d7384): Refine texture details\n",
+    "- **Phase 3** (512\u00d7512): Fine-tune for high-res quality\n",
     "\n",
     "### Anti-collapse measures:\n",
+    "1. **KL warmup**: \u03b2 goes from 0 \u2192 target over first 5000 steps\n",
     "2. **Free bits**: Each latent dimension must use at least 0.25 nats\n",
     "3. **Discriminator cold start**: Only activates after 10000 steps\n",
     "4. **Adaptive disc weight**: Balances recon vs adversarial gradients\n",
    "outputs": [],
    "source": [
     "# ============================================================================\n",
+    "# Training Loop \u2014 Streaming-Compatible\n",
     "# ============================================================================\n",
+    "# Since streaming datasets don't have len(), we train by step count.\n",
+    "# The stream automatically loops when exhausted.\n",
     "\n",
     "global_step = 0\n",
     "history = []\n",
     "start_time = time.time()\n",
     "\n",
+    "# Get a fixed batch for visualization (detach from stream)\n",
+    "vis_batch = next(iter(dataloader)).clone()\n",
+    "\n",
+    "print(f'\\n\ud83d\ude80 Starting training! Target: {CONFIG[\"max_steps\"]} steps')\n",
+    "print(f'   KL warmup: 0 \u2192 {CONFIG[\"kl_weight\"]} over {CONFIG[\"kl_warmup_steps\"]} steps')\n",
     "print(f'   Discriminator starts at step {CONFIG[\"disc_start\"]}\\n')\n",
     "\n",
     "model.train()\n",
+    "\n",
+    "# Infinite iterator over the streaming dataloader\n",
+    "data_iter = iter(dataloader)\n",
+    "\n",
+    "while global_step < CONFIG['max_steps']:\n",
+    "    # Get next batch (re-create iterator if stream exhausted)\n",
+    "    try:\n",
+    "        batch = next(data_iter)\n",
+    "    except StopIteration:\n",
+    "        # Stream exhausted = 1 epoch done. Re-create.\n",
+    "        data_iter = iter(dataloader)\n",
+    "        batch = next(data_iter)\n",
+    "\n",
+    "    batch = batch.to(device)\n",
+    "\n",
+    "    # KL warmup\n",
+    "    kl_w = CONFIG['kl_weight'] * min(1.0, global_step / max(1, CONFIG['kl_warmup_steps']))\n",
+    "    criterion.kl_weight = kl_w\n",
+    "\n",
+    "    # === VAE update ===\n",
+    "    opt_vae.zero_grad()\n",
+    "    with autocast('cuda', enabled=device=='cuda'):\n",
+    "        recon, posteriors = model(batch)\n",
+    "        loss_vae, log_vae = criterion(batch, recon, posteriors, 0, global_step,\n",
+    "                                      model.get_last_decoder_layer())\n",
+    "    scaler_vae.scale(loss_vae).backward()\n",
+    "    scaler_vae.unscale_(opt_vae)\n",
+    "    gn = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)\n",
+    "    scaler_vae.step(opt_vae)\n",
+    "    scaler_vae.update()\n",
+    "\n",
+    "    # === Discriminator update ===\n",
+    "    opt_disc.zero_grad()\n",
+    "    with autocast('cuda', enabled=device=='cuda'):\n",
+    "        with torch.no_grad():\n",
+    "            recon_d, _ = model(batch)\n",
+    "        loss_disc, log_disc = criterion(batch, recon_d, posteriors, 1, global_step)\n",
+    "    if global_step >= CONFIG['disc_start']:\n",
+    "        scaler_disc.scale(loss_disc).backward()\n",
+    "        scaler_disc.unscale_(opt_disc)\n",
+    "        torch.nn.utils.clip_grad_norm_(criterion.discriminator.parameters(), 1.0)\n",
+    "        scaler_disc.step(opt_disc)\n",
+    "        scaler_disc.update()\n",
+    "\n",
+    "    global_step += 1\n",
+    "    log = {**log_vae, **log_disc, 'step': global_step, 'grad_norm': gn.item(), 'kl_w': kl_w}\n",
+    "\n",
+    "    if global_step % CONFIG['log_every'] == 0:\n",
+    "        history.append(log)\n",
+    "        elapsed = (time.time() - start_time) / 60\n",
+    "        print(f\"Step {global_step:6d} | L1:{log['l1']:.4f} | Perc:{log.get('perc',0):.4f} | \"\n",
+    "              f\"KL:{log.get('kl_base',0):.1f}/{log.get('kl_detail',0):.1f}/{log.get('kl_style',0):.1f} | \"\n",
+    "              f\"D:{log.get('d_loss',0):.4f} | G:{log.get('g_loss',0):.4f} | \"\n",
+    "              f\"GN:{log['grad_norm']:.2f} | {elapsed:.1f}min\")\n",
+    "\n",
+    "    if global_step % CONFIG['vis_every'] == 0:\n",
+    "        clear_output(wait=True)\n",
+    "        visualize_reconstruction(model, vis_batch, global_step)\n",
+    "        plot_losses(history)\n",
+    "\n",
+    "    if global_step % CONFIG['save_every'] == 0:\n",
+    "        os.makedirs('checkpoints', exist_ok=True)\n",
+    "        torch.save({'model': model.state_dict(),\n",
+    "                   'disc': criterion.discriminator.state_dict(),\n",
+    "                   'opt_vae': opt_vae.state_dict(),\n",
+    "                   'opt_disc': opt_disc.state_dict(),\n",
+    "                   'step': global_step, 'config': CONFIG},\n",
+    "                  f'checkpoints/pma_vae_step{global_step}.pt')\n",
+    "        print(f'\ud83d\udcbe Saved checkpoint at step {global_step}')\n",
     "\n",
     "# Final save\n",
     "torch.save({'model': model.state_dict(), 'config': CONFIG}, 'checkpoints/pma_vae_final.pt')\n",
     "total_time = (time.time() - start_time) / 60\n",
+    "print(f'\\n\u2705 Training complete! {global_step} steps in {total_time:.1f} minutes')\n",
+    "print(f'\ud83d\udcbe Final model saved to checkpoints/pma_vae_final.pt')"
    ]
   },
   {
     "    psnr = -10 * math.log10(mse + 1e-8)\n",
     "    psnrs.append(psnr)\n",
     "\n",
+    "print(f'\\n\ud83d\udcca Evaluation Results:')\n",
     "print(f'   Average PSNR: {sum(psnrs)/len(psnrs):.2f} dB')\n",
     "print(f'   Min PSNR: {min(psnrs):.2f} dB')\n",
     "print(f'   Max PSNR: {max(psnrs):.2f} dB')"
     "        out = model.decoder(pa['base_mu'], pa['detail_mu'], z_style)\n",
     "    img = out[0].cpu().permute(1,2,0).numpy() * 0.5 + 0.5\n",
     "    axes[i].imshow(img.clip(0,1))\n",
+    "    axes[i].set_title(f'\u03b1={alpha:.2f}')\n",
     "    axes[i].axis('off')\n",
     "plt.suptitle('Style Interpolation (structure fixed, style varies)', fontsize=14)\n",
     "plt.tight_layout()\n",
     "model.eval()\n",
     "\n",
     "# Dummy inputs matching the latent shapes\n",
+    "dummy_base = torch.randn(1, 24, 16, 16, device=device)   # For 256\u00d7256 input\n",
     "dummy_detail = torch.randn(1, 6, 32, 32, device=device)\n",
     "dummy_style = torch.randn(1, 96, device=device)\n",
     "\n",
     ")\n",
     "\n",
     "onnx_size = os.path.getsize('pma_vae_decoder.onnx') / 1024**2\n",
+    "print(f'\\n\ud83d\udcf1 ONNX decoder exported!')\n",
     "print(f'   Size: {onnx_size:.1f} MB')\n",
     "print(f'   Ready for: Core ML, TFLite, ONNX Runtime Mobile')\n",
     "\n",
    "source": [
     "## 9. Progressive Resolution Training\n",
     "\n",
+    "After initial training at 256\u00d7256, progressively increase resolution.\n",
     "The model handles variable resolutions thanks to the convolutional architecture."
    ]
   },
     "# for pg in opt_disc.param_groups: pg['lr'] *= 0.5\n",
     "# \n",
     "# # Continue training (copy the training loop above with dataloader_hr)\n",
+    "# print(f'Phase 2: Training at {NEW_RESOLUTION}\u00d7{NEW_RESOLUTION}')\n",
     "# print(f'Batches per epoch: {len(dataloader_hr)}')"
    ]
   },
     "model.eval()\n",
     "\n",
     "# Take a high-res image and downsample it\n",
+    "hr_img = test_batch[0:1]  # 256\u00d7256\n",
+    "lr_img = F.interpolate(hr_img, scale_factor=0.5, mode='bilinear', align_corners=False)  # 128\u00d7128\n",
     "lr_upscaled = F.interpolate(lr_img, size=(256, 256), mode='bilinear', align_corners=False)\n",
     "\n",
     "with torch.no_grad():\n",
     "| Component | Choice | Why |\n",
     "|---|---|---|\n",
     "| Backbone | MobileConv + Parallel 2D Mamba | Fast, efficient, attention-free |\n",
+    "| Downsampling | PixelUnshuffle \u2192 stride-2 conv | Lossless initial features |\n",
     "| Upsampling | PixelShuffle (sub-pixel) | Mobile-friendly, no checkerboard |\n",
     "| Latent | Multi-scale (base/detail/style) | Controllable, prevents collapse |\n",
     "| Style control | FiLM conditioning | Lightweight, multiplicative |\n",
     "\n",
     "| Feature | PMA-VAE | SD-VAE | NVAE |\n",
     "|---|---|---|---|\n",
+    "| Attention-free | \u2705 | \u274c | \u274c |\n",
+    "| Mobile-friendly decoder | \u2705 | \u274c | \u274c |\n",
+    "| Multi-scale latent | \u2705 | \u274c | \u2705 |\n",
+    "| Style control built-in | \u2705 | \u274c | \u274c |\n",
     "| Decoder params | ~4-8M | ~50M | ~100M+ |\n",
+    "| Parallel training | \u2705 | \u2705 | \u2705 |\n",
+    "| Free Colab trainable | \u2705 | \u274c | \u274c |"
    ]
   },
   {
    "source": [
     "---\n",
     "\n",
+    "## \ud83d\udcda References\n",
     "\n",
     "- **Mamba**: Gu & Dao, 2023. [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752)\n",
     "- **VMamba**: Liu et al., 2024. [VMamba: Visual State Space Model](https://arxiv.org/abs/2401.10166)\n",
    ]
   }
  ]
+}