# GEN AI — Programming Assignment ## Generative Models --- ## Submission Policy Both questions must be submitted together in a **SINGLE zip file** named: ``` {NAME}_{STUDENT_ID}.zip ``` The zip file must contain all code folders for both questions and one combined PDF report. **Do NOT include** datasets, model checkpoints, or large binary files. --- # Question 1 • 25 Marks ## Denoising Diffusion Probabilistic Models (DDPM) > Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning. ### Environment Setup Create a conda environment named `ddpm` and install PyTorch: ```bash conda create --name ddpm python=3.10 conda activate ddpm conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch pip install -r requirements.txt ``` ### Code Structure ``` ddpm_assignment/ ├── 2d_plot_diffusion_todo/ (Task 1) │ ├── ddpm_tutorial.ipynb <-- Main notebook │ ├── dataset.py <-- Swiss-roll, moon, gaussians │ ├── network.py <-- (TODO) Noise prediction network │ └── ddpm.py <-- (TODO) DDPM pipeline │ ├── task_1_controlnet/ (Task 2) │ ├── diffusion/ │ │ ├── unets/ │ │ │ ├── unet_2d_condition.py <-- (TODO) Integrate ControlNet into UNet │ │ │ └── unet_2d_blocks.py <-- Basic UNet components │ │ ├── controlnet.py <-- (TODO) Implement ControlNet │ │ └── pipeline_controlnet.py <-- Diffusion pipeline with ControlNet │ ├── train.py <-- Training code │ ├── train.sh <-- Hyperparameter script │ └── inference.ipynb <-- Inference notebook └── requirements.txt ``` ### Background Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise. A typical DDPM pipeline consists of three components: - **Forward Process**: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence x₀ → x₁ → … → x_T - **Reverse Process**: A learned neural network iteratively denoises x_T back to x₀, step by step - **Training Objective**: The network is trained using a simplified noise-matching loss — predicting the noise ε added at each step --- ## Task 1: Simple DDPM Pipeline with Swiss-Roll In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images. After completing your implementation, train the model and evaluate it by running `ddpm_tutorial.ipynb` in the `2d_plot_diffusion_todo` directory. ### TODO #### 1-1: Build a Noise Prediction Network Implement the noise prediction network in `network.py`. The network takes a noisy data point and a timestep embedding as input, and predicts the noise ε added at that step. It should consist of `TimeLinear` layers with feature dimensions: ``` [dim_in, dim_hids[0], ..., dim_hids[-1], dim_out] ``` - Every `TimeLinear` layer except the final output layer must be followed by a ReLU activation - The final layer has no activation — it directly outputs the predicted noise > **⬡ Hint** > `TimeLinear` is a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer. #### 1-2: Construct the Forward and Reverse Process In `ddpm.py`, implement the three core functions of the DDPM pipeline: - **`q_sample(x_0, t, noise)`**: The forward process. Given a clean sample x₀ and timestep t, return the noised sample x_t using the closed-form formula: ``` x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε, where ε ~ N(0, I) ``` - **`p_sample(x_t, t)`**: One-step reverse transition. Use the trained network to predict ε, then compute the denoised estimate of x_{t-1} - **`p_sample_loop(shape)`**: Full reverse process. Starting from x_T ~ N(0, I), iterate `p_sample()` from t=T down to t=1 and return the final sample x₀ > **⬡ Important** > Use the pre-computed noise schedule (α_t, ᾱ_t, β_t) provided in the starter code. Do not redefine the schedule inside these functions. #### 1-3: Implement the Training Objective In `ddpm.py`, implement `compute_loss()`. This function should: 1. Sample a random timestep t uniformly from {1, …, T} for each element in the batch 2. Sample noise ε ~ N(0, I) of the same shape as the input x₀ 3. Compute the noised sample x_t using `q_sample()` 4. Pass x_t and t to the noise prediction network to obtain the predicted noise ε̂ 5. Return the simplified noise-matching loss: **L = ||ε − ε̂||²** #### 1-4: Training and Evaluation Once your implementation is complete, open and run `ddpm_tutorial.ipynb` via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution. **Include in your report:** - The training loss curve - The Chamfer Distance (CD) value reported after running the notebook - A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution --- ## Task 2: ControlNet on Fill50K Dataset In this task, you will implement ControlNet — a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers. ### Prerequisites: Hugging Face Setup Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model: - Sign into Hugging Face at https://huggingface.co - Obtain your Access Token at https://huggingface.co/settings/tokens - Log in from your terminal: ```bash $ huggingface-cli login ``` Install the ControlNet environment: ```bash conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -r requirements.txt ``` Verify your setup by generating a test image with Stable Diffusion: ```python import torch from diffusers import StableDiffusionPipeline model_id = "CompVis/stable-diffusion-v1-4" pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") image = pipe("a photo of an astronaut riding a horse on mars").images[0] image.save("test.png") ``` ### TODO #### Task 0: Generate Baseline Images Using the 5 text prompts in `./task_1_controlnet/data/test_prompts.json`, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report. #### 2-1: Implement Zero-Convolution In `diffusion/controlnet.py` (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1×1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs. > **⬡ Hint** > Use `nn.Conv2d(channels, channels, kernel_size=1)` and explicitly set `weight.data` and `bias.data` to zero after initialization. #### 2-2: Initialize ControlNet from Pretrained UNet In `diffusion/controlnet.py` (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch. #### 2-3: Apply Zero-Convolution to Residual Features In `diffusion/controlnet.py` (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute: ``` h_out = ZeroConv(h) ``` #### 2-4: Integrate ControlNet Outputs into UNet In `diffusion/unets/unet_2d_condition.py` (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input: ``` decoder_input = decoder_input + controlnet_residual ``` > **⬡ Important** > Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling. #### 2-5: Train and Evaluate Train ControlNet on the Fill50K dataset (automatically downloaded by the `load_dataset()` function in `train.py`) by running: ```bash $ sh train.sh ``` Then, run `inference.ipynb` to generate images conditioned on 5 different edge maps from `./data/test_conditions`, using the text prompts in `data/test_prompts.json`. **Include in your report:** - The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts - The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images - A brief analysis of each condition: does the generated image accurately follow the edge map? --- # Question 2 • 25 Marks ## Generative Adversarial Networks (GAN) > Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits. ### Environment Setup Create a conda environment named `gan_assignment` and install the required packages: ```bash conda create --name gan_assignment python=3.10 conda activate gan_assignment conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch pip install -r requirements.txt ``` The `requirements.txt` includes: `numpy`, `matplotlib`, `scipy`, `tqdm`, and `jupyter`. ### Code Structure ``` gan_assignment/ ├── task_1_vanilla_gan/ (Task 1) │ ├── gan_tutorial.ipynb <-- Main notebook │ ├── dataset.py <-- 2D toy dataset definitions │ ├── network.py <-- (TODO) Generator & Discriminator │ └── gan.py <-- (TODO) GAN training pipeline │ ├── task_2_dcgan/ (Task 2) │ ├── dcgan_tutorial.ipynb <-- Main notebook │ ├── network.py <-- (TODO) DCGAN architecture │ └── dcgan.py <-- (TODO) DCGAN training loop └── requirements.txt ``` --- ## Task 1: Vanilla GAN on 2D Swiss-Roll Data Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation. ### TODO #### 1-1: Build the Generator Network Implement the `Generator` class in `network.py`. The Generator maps a noise vector z to a 2D output point: - **Input**: noise vector z of shape `(batch_size, latent_dim)`, with `latent_dim = 16` by default - **Architecture**: fully-connected layers with dimensions `[latent_dim, dim_hids[0], …, dim_hids[-1], 2]` - **Activation**: ReLU after every hidden layer (except the final output layer) - **Output**: 2D point of shape `(batch_size, 2)` with a Tanh activation on the last layer > **⬡ Hint** > Use `nn.Sequential` or `nn.ModuleList` to stack your layers. #### 1-2: Build the Discriminator Network Implement the `Discriminator` class in `network.py`. The Discriminator takes a 2D point and outputs a real/fake probability: - **Input**: a 2D point of shape `(batch_size, 2)` - **Architecture**: fully-connected layers with dimensions `[2, dim_hids[0], …, dim_hids[-1], 1]` - **Activation**: LeakyReLU (negative slope = 0.2) after every hidden layer - **Output**: a scalar of shape `(batch_size, 1)` with a Sigmoid activation to produce a probability in [0, 1] #### 1-3: Implement the GAN Training Step In `gan.py`, implement the `train_step()` function which performs one full update of both G and D: **1. Discriminator update:** - Sample a real batch x from the dataset - Sample z ~ N(0, I) and generate fake samples: `x_fake = G(z)` - Compute the discriminator BCE loss: ``` L_D = −E[log D(x_real)] − E[log(1 − D(x_fake.detach()))] ``` - Zero grad on D optimizer, backpropagate, and update D only **2. Generator update:** - Sample a new batch of z ~ N(0, I) - Compute the non-saturating generator loss: ``` L_G = −E[log D(G(z))] ``` - Zero grad on G optimizer, backpropagate, and update G only > **⬡ Important** > Always call `.detach()` on `x_fake` before passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step. #### 1-4: Implement the Sampling Function In `gan.py`, implement `sample(G, n_samples, latent_dim, device)`: - Sample `n_samples` noise vectors z from N(0, I) with shape `(n_samples, latent_dim)` - Pass through G to get generated 2D points - Return as a NumPy array of shape `(n_samples, 2)` - Use `torch.no_grad()` to disable gradient tracking during inference #### 1-5: Training and Evaluation Run `gan_tutorial.ipynb`. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points. **Include in your report:** - G and D training loss curves (on the same plot or side-by-side) - The Chamfer Distance (CD) value - A scatter plot of generated 2D points vs. real Swiss-Roll data - Brief analysis (2–3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability? --- ## Task 2: Deep Convolutional GAN (DCGAN) on MNIST Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality. ### TODO #### 2-1: Implement the DCGAN Generator Implement `DCGenerator` in `task_2_dcgan/network.py` using transposed convolutions to upsample from noise to a full image: - **Input**: noise vector z of shape `(batch_size, latent_dim, 1, 1)`, where `latent_dim = 100` - Use `ConvTranspose2d` layers to upsample progressively to `(1, 28, 28)` - **Channel sequence**: `latent_dim → 256 → 128 → 64 → 1` - Apply `BatchNorm2d + ReLU` after every `ConvTranspose2d` except the last - Apply Tanh to the final output > **⬡ Tip** > `ConvTranspose2d(kernel_size=4, stride=2, padding=1)` doubles spatial resolution. Use `kernel_size=4, stride=1, padding=0` for the first layer to go from 1×1 to 4×4. #### 2-2: Implement the DCGAN Discriminator Implement `DCDiscriminator` in `task_2_dcgan/network.py` using strided convolutions to downsample the input image: - **Input**: grayscale image of shape `(batch_size, 1, 28, 28)` - Use `Conv2d` layers to downsample to a single scalar output - **Channel sequence**: `1 → 64 → 128 → 256 → 1` - Apply `BatchNorm2d + LeakyReLU` (slope 0.2) after every `Conv2d` except the first and last - Apply Sigmoid to the final output > **⬡ Important** > Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability. #### 2-3: Implement the DCGAN Training Loop In `task_2_dcgan/dcgan.py`, implement `train_one_epoch()` which iterates over the full MNIST training set for one epoch. For each mini-batch: **1. Discriminator update:** - BCE loss on real images (label = 1) → `L_D_real` - BCE loss on fake images G(z) (label = 0) → `L_D_fake` - `L_D = L_D_real + L_D_fake` → `zero_grad`, `backward`, `step` D optimizer **2. Generator update:** - Generate new fake images and compute: `L_G = BCE(D(G(z)), 1)` - `zero_grad`, `backward`, `step` G optimizer #### 2-4: Weight Initialization Implement `weights_init()` in `task_2_dcgan/network.py` and apply it via `model.apply(weights_init)`: - `Conv2d` and `ConvTranspose2d`: initialize weights ~ N(0, 0.02) - `BatchNorm2d`: initialize weights ~ N(1.0, 0.02), bias = 0 - All other layer types: leave unchanged > **⬡ Hint** > Use `isinstance(m, nn.Conv2d)` to check layer types. Use `torch.nn.init.normal_()` for weight initialization. #### 2-5: Training and Evaluation Run `dcgan_tutorial.ipynb`. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4×8 grid of generated digits per epoch, and reports the Fréchet Inception Distance (FID) score. **Include in your report:** - G and D training loss curves over all iterations - A 4×8 grid of generated MNIST digits from your final trained model - The FID score reported by the notebook - Brief analysis (2–3 sentences): comment on image quality, diversity, and any observed instability --- # Combined Submission Instructions > **Both questions — one zip file — one PDF report** ## What to Submit You will submit everything — both Question 1 (DDPM) and Question 2 (GAN) — in a single zip file. There is no separate submission per question. ### Zip File Structure Your zip file must follow this exact folder layout: ``` {NAME}_{STUDENT_ID}.zip ├── ddpm_assignment/ │ ├── 2d_plot_diffusion_todo/ │ │ ├── network.py <-- Your implementation │ │ └── ddpm.py <-- Your implementation │ └── task_1_controlnet/ │ └── diffusion/ │ ├── controlnet.py <-- Your implementation │ └── unets/ │ └── unet_2d_condition.py <-- Your implementation │ ├── gan_assignment/ │ ├── task_1_vanilla_gan/ │ │ ├── network.py <-- Your implementation │ │ └── gan.py <-- Your implementation │ └── task_2_dcgan/ │ ├── network.py <-- Your implementation │ └── dcgan.py <-- Your implementation │ └── {NAME}_{STUDENT_ID}.pdf <-- Combined report ``` ### Combined PDF Report Write one single PDF report named `{NAME}_{STUDENT_ID}.pdf` that covers both questions. The report **must not exceed 5 pages** (excluding references). It should contain the following sections in order: **Section 1 — DDPM (Question 1):** - Task 1: Training loss curve, CD value, particle visualization, and 2–3 sentence analysis - Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis **Section 2 — GAN (Question 2):** - Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2–3 sentence analysis - Task 2: G and D loss curves, 4×8 generated MNIST grid, FID score, and 2–3 sentence analysis ### Naming Convention **Do NOT include in your zip:** - Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.) - Model checkpoints (`.pth`, `.ckpt` files) - Generated image folders - Pretrained model weights (e.g., the Stable Diffusion checkpoint) | Item | Format | |------------|--------------------------------------------------------------| | Zip file | `{NAME}_{STUDENT_ID}.zip` — e.g. `JOHN_DOE_2024001.zip` | | PDF report | `{NAME}_{STUDENT_ID}.pdf` — e.g. `JOHN_DOE_2024001.pdf` | --- ## Academic Integrity You may consult the following reference papers while working on this assignment: - Ho et al. (2020). *Denoising Diffusion Probabilistic Models.* - Zhang et al. (2023). *Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).* - Goodfellow et al. (2014). *Generative Adversarial Networks.* - Radford et al. (2015). *Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).* > It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment.