Spaces:

Ashutosh1975270
/

vae-mnist-generative

Paused

App Files Files Community

vae-mnist-generative / project_problem.md

Ashutosh

Upload folder using huggingface_hub

3dbfabb verified 2 months ago

preview code

raw

history blame contribute delete

10.2 kB

Project 11 — Variational Autoencoder (VAE)

Level: Advanced | Dataset: MNIST (torchvision) | Framework: PyTorch

Objective

Build a VAE to learn a continuous, structured latent space and generate new digits. Cover: reparameterization trick, ELBO loss (reconstruction + KL divergence), latent space interpolation, conditional generation.

Project Structure

11_vae_mnist/
├── notebooks/
│   ├── 01_vae_theory.ipynb
│   ├── 02_train.ipynb
│   └── 03_latent_explore.ipynb
├── data/
├── models/model.pkl
├── charts/
├── path_utils.py
├── dashboard_core.py
└── app.py

Notebook 01 — Theory (`01_vae_theory.ipynb`)

STOP 1 — AE vs VAE Core Difference

Write theory cells explaining:

AE: x → z (point) → x̂ — deterministic latent space
VAE: x → (μ, σ) → z ~ N(μ, σ²) → x̂ — stochastic latent space
Run a simple AE on 2D toy data, show the disconnected latent space
Agent stops here. Explain:
- Why AE's latent space has "holes": decoder was never trained on points between training samples
- What "holes" cause: generating from the middle of latent space gives nonsense
- How VAE fixes this: forces latent space to be a continuous smooth Gaussian
- The key intuition: VAE learns WHERE to put things + HOW WIDE to make the region around them
Wait for user confirmation before continuing

STOP 2 — Probabilistic Encoder

In a VAE, the encoder outputs TWO vectors: mu and log_var (each shape [B, latent_dim])
From these, we sample: z = mu + epsilon * std where epsilon ~ N(0,1) and std = exp(0.5 * log_var)
Agent stops here. Explain:
- Why we output log_var not var: log_var can be any real number, var must be positive
- What the encoder is learning: a distribution over z, not a single point
- The sampling process: every forward pass samples a different z (stochastic)
- Why this stochasticity enables generation: we can sample from N(0,1) without needing an input
Wait for confirmation

STOP 3 — Reparameterization Trick

Write math cells:

Naive: z ~ N(μ, σ²) — cannot backpropagate through sampling (stochastic node)
Reparameterized: z = μ + σ * ε, where ε ~ N(0,1) — gradients flow through μ and σ
Implement both and show that naive breaks .backward()
Agent stops here. Explain:
- Why we can't backpropagate through a sampling operation (not a deterministic function)
- The trick: move the randomness to ε (a separate input), make z a DETERMINISTIC function of (μ, σ, ε)
- Why gradients now flow through μ and σ: they're just parameters in z = μ + σ * ε
- This is one of the most important tricks in modern deep learning
Wait for confirmation

Notebook 02 — Training (`02_train.ipynb`)

STOP 4 — VAE Architecture

Encoder:
  Flatten (28*28=784) → Linear(784, 400) → ReLU
  → Linear(400, latent_dim*2) split into → mu [B, latent_dim], log_var [B, latent_dim]

Reparameterize: z = mu + exp(0.5*log_var) * epsilon

Decoder:
  Linear(latent_dim, 400) → ReLU
  Linear(400, 784) → Sigmoid [output in (0,1) — pixel values]

Use latent_dim=20
Agent stops here. Explain:
- Why Sigmoid at decoder output: MNIST pixels in [0,1]
- Why latent_dim=20: enough to encode digit identity + style
- How to split encoder output into mu and log_var: mu, log_var = out.chunk(2, dim=1) or out[:, :ld] and out[:, ld:]
- What the 20-dimensional z represents: each dimension captures some aspect of digit variation
Wait for confirmation

STOP 5 — ELBO Loss Function

def elbo_loss(x, x_hat, mu, log_var, beta=1.0):
    # Reconstruction: binary cross entropy (pixels are in [0,1])
    recon = F.binary_cross_entropy(x_hat, x, reduction='sum')
    # KL divergence: push posterior N(mu, sigma) toward prior N(0,1)
    kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
    return (recon + beta * kl) / x.size(0)  # normalize by batch size

Agent stops here. Explain:
- What ELBO means: Evidence Lower BOund — maximizing ELBO ≈ maximizing likelihood
- The two terms:
  1. Reconstruction loss: AE objective — how well we reconstruct input
  2. KL divergence: regularizer — pushes z distribution toward standard Gaussian
- Why KL toward N(0,1): so we can sample from N(0,1) at generation time
- The KL formula: -0.5 * sum(1 + log_var - mu² - exp(log_var)) — derive this
- What β does (β-VAE): β>1 encourages more disentangled latent space
Wait for confirmation

STOP 6 — KL Annealing

Start with β=0, linearly increase to β=1 over first 20 epochs
Plot reconstruction loss and KL loss separately per epoch
Agent stops here. Explain:
- What KL annealing solves: "posterior collapse" — without annealing, KL often collapses to 0 early
- Posterior collapse: model ignores z and decoder learns to reconstruct without latent code
- Why starting with β=0 allows the model to learn reconstruction first
- How to detect posterior collapse: KL term goes to 0, mu and log_var stop changing
Wait for confirmation

STOP 7 — Training Loop

50 epochs, Adam lr=1e-3, batch_size=128
Track total ELBO, reconstruction term, KL term separately
Plot all three curves
Agent stops here. Explain:
- Why tracking reconstruction and KL separately is important (not just total loss)
- What healthy training looks like: KL increases gradually, reconstruction decreases
- What unhealthy training looks like: KL → 0 (collapse) or reconstruction doesn't decrease
Wait for confirmation

Notebook 03 — Latent Space Exploration (`03_latent_explore.ipynb`)

STOP 8 — 2D Latent Space Visualization

If latent_dim=2 (train a separate 2D version): plot all test digits in 2D z-space colored by digit label
If latent_dim=20: use t-SNE to project to 2D
Agent stops here. Explain:
- What we expect: each digit forms a cluster, similar digits (4 vs 9, 3 vs 8) cluster closer
- What "disentangled" means: different dimensions control different factors (style, rotation, thickness)
- How the VAE latent space is structured compared to regular AE (smooth, no holes)
Wait for confirmation

STOP 9 — Latent Space Interpolation

Encode two different digit images to get z1, z2
Generate 10 intermediate z values: z = (1-t)*z1 + t*z2 for t in [0,1]
Decode each z and display as a row of images
Agent stops here. Explain:
- Why interpolation works in VAE but not in AE: VAE latent space is continuous (no holes)
- What a smooth interpolation shows: gradual morphing from digit A to digit B
- What a broken AE interpolation shows: sudden jumps and nonsense images in the middle
- This is the key visual proof that VAE learns a better structured latent space
Wait for confirmation

STOP 10 — Generation from Prior

Sample 64 z vectors from N(0,1): z = torch.randn(64, latent_dim)
Decode all 64 samples
Display as 8×8 grid of generated digits
Agent stops here. Explain:
- Why sampling from N(0,1) works: KL loss forced the posterior to be close to N(0,1)
- What good generation looks like: recognizable digits, diverse styles
- What bad generation looks like (poor training or posterior collapse): blurry identical images
- The fundamental difference from AE: AE cannot generate because we don't know which z values are valid
Wait for confirmation

STOP 11 — Digit-Conditioned Generation (Simple)

For each of 10 digit classes: find all test samples, average their z vectors → class prototype z
Generate 10 images from the 10 prototype z vectors
Agent stops here. Explain:
- What "class prototype in latent space" means: center of the class cluster
- How to do true conditional generation (Conditional VAE — CVAE): feed label as input to encoder and decoder
- Why this simple approach works at all: VAE clusters same-class digits together in z-space
Wait for confirmation

STOP 12 — Reconstruction Quality

Pick 20 test images
Show side by side: original | reconstruction
Compute SSIM (Structural Similarity) between originals and reconstructions
Agent stops here. Explain:
- Why reconstructions look slightly blurry: VAE averages over the distribution → blurriness
- The VAE-GAN tradeoff: VAE → blurry but stable, GAN → sharp but training unstable
- What SSIM measures vs MSE: SSIM accounts for structure, not just pixel values
Wait for confirmation

STOP 13 — Save & Generation Function

Save model.state_dict()
Write generate(n=16) → n images sampled from prior
Write encode_and_reconstruct(pil_image) → z_vector, reconstructed_image
Write interpolate(img1, img2, steps=10) → list of 10 images
Agent stops here. Explain:
- Why generation requires no input (unlike all previous projects)
- The three use modes of a trained VAE: reconstruct, encode, generate
- Which operations require torch.no_grad() and which don't (generation always needs it)
Wait for confirmation

`dashboard_core.py`

Functions:

load_model() → model
generate_digits(n=16) → grid image
interpolate(z1, z2, steps=10) → list of PIL images
get_latent_viz() → 2D coords + labels for all test digits
get_training_curves() → recon_loss, kl_loss, total_loss arrays

`app.py` — Streamlit (~80 lines)

Sections:

"Generate" button → display 8×8 grid of generated digits
Upload digit image → show reconstruction + z vector values
Tab 1: ELBO curve (split into recon + KL)
Tab 2: t-SNE latent space scatter plot colored by digit
Tab 3: Interpolation visualization between two selected digits

Key Concepts Covered

AE vs VAE: deterministic vs stochastic latent space
Reparameterization trick (the core DL trick)
ELBO loss = reconstruction + β * KL divergence
KL divergence math: -0.5 * sum(1 + log_var - mu² - exp(log_var))
Posterior collapse and KL annealing
Latent space interpolation (proof of continuity)
Generation from N(0,1) prior
β-VAE for disentanglement