genai-assignment-shivank / assignment.md
shivank21's picture
Add files using upload-large-folder tool
2ec1243 verified

GEN AI β€” Programming Assignment

Generative Models


Submission Policy

Both questions must be submitted together in a SINGLE zip file named:

{NAME}_{STUDENT_ID}.zip

The zip file must contain all code folders for both questions and one combined PDF report.

Do NOT include datasets, model checkpoints, or large binary files.


Question 1 β€’ 25 Marks

Denoising Diffusion Probabilistic Models (DDPM)

Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning.

Environment Setup

Create a conda environment named ddpm and install PyTorch:

conda create --name ddpm python=3.10
conda activate ddpm
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt

Code Structure

ddpm_assignment/
β”œβ”€β”€ 2d_plot_diffusion_todo/        (Task 1)
β”‚   β”œβ”€β”€ ddpm_tutorial.ipynb        <-- Main notebook
β”‚   β”œβ”€β”€ dataset.py                 <-- Swiss-roll, moon, gaussians
β”‚   β”œβ”€β”€ network.py                 <-- (TODO) Noise prediction network
β”‚   └── ddpm.py                    <-- (TODO) DDPM pipeline
β”‚
β”œβ”€β”€ task_1_controlnet/             (Task 2)
β”‚   β”œβ”€β”€ diffusion/
β”‚   β”‚   β”œβ”€β”€ unets/
β”‚   β”‚   β”‚   β”œβ”€β”€ unet_2d_condition.py   <-- (TODO) Integrate ControlNet into UNet
β”‚   β”‚   β”‚   └── unet_2d_blocks.py      <-- Basic UNet components
β”‚   β”‚   β”œβ”€β”€ controlnet.py              <-- (TODO) Implement ControlNet
β”‚   β”‚   └── pipeline_controlnet.py     <-- Diffusion pipeline with ControlNet
β”‚   β”œβ”€β”€ train.py                       <-- Training code
β”‚   β”œβ”€β”€ train.sh                       <-- Hyperparameter script
β”‚   └── inference.ipynb                <-- Inference notebook
└── requirements.txt

Background

Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise.

A typical DDPM pipeline consists of three components:

  • Forward Process: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence xβ‚€ β†’ x₁ β†’ … β†’ x_T
  • Reverse Process: A learned neural network iteratively denoises x_T back to xβ‚€, step by step
  • Training Objective: The network is trained using a simplified noise-matching loss β€” predicting the noise Ξ΅ added at each step

Task 1: Simple DDPM Pipeline with Swiss-Roll

In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images.

After completing your implementation, train the model and evaluate it by running ddpm_tutorial.ipynb in the 2d_plot_diffusion_todo directory.

TODO

1-1: Build a Noise Prediction Network

Implement the noise prediction network in network.py. The network takes a noisy data point and a timestep embedding as input, and predicts the noise Ξ΅ added at that step. It should consist of TimeLinear layers with feature dimensions:

[dim_in, dim_hids[0], ..., dim_hids[-1], dim_out]
  • Every TimeLinear layer except the final output layer must be followed by a ReLU activation
  • The final layer has no activation β€” it directly outputs the predicted noise

⬑ Hint TimeLinear is a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer.

1-2: Construct the Forward and Reverse Process

In ddpm.py, implement the three core functions of the DDPM pipeline:

  • q_sample(x_0, t, noise): The forward process. Given a clean sample xβ‚€ and timestep t, return the noised sample x_t using the closed-form formula:

    x_t = √ᾱ_t Β· xβ‚€ + √(1 βˆ’ αΎ±_t) Β· Ξ΅,  where Ξ΅ ~ N(0, I)
    
  • p_sample(x_t, t): One-step reverse transition. Use the trained network to predict Ξ΅, then compute the denoised estimate of x_{t-1}

  • p_sample_loop(shape): Full reverse process. Starting from x_T ~ N(0, I), iterate p_sample() from t=T down to t=1 and return the final sample xβ‚€

⬑ Important Use the pre-computed noise schedule (α_t, ᾱ_t, β_t) provided in the starter code. Do not redefine the schedule inside these functions.

1-3: Implement the Training Objective

In ddpm.py, implement compute_loss(). This function should:

  1. Sample a random timestep t uniformly from {1, …, T} for each element in the batch
  2. Sample noise Ξ΅ ~ N(0, I) of the same shape as the input xβ‚€
  3. Compute the noised sample x_t using q_sample()
  4. Pass x_t and t to the noise prediction network to obtain the predicted noise Ξ΅Μ‚
  5. Return the simplified noise-matching loss: L = ||Ξ΅ βˆ’ Ξ΅Μ‚||Β²

1-4: Training and Evaluation

Once your implementation is complete, open and run ddpm_tutorial.ipynb via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution.

Include in your report:

  • The training loss curve
  • The Chamfer Distance (CD) value reported after running the notebook
  • A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution

Task 2: ControlNet on Fill50K Dataset

In this task, you will implement ControlNet β€” a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers.

Prerequisites: Hugging Face Setup

Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model:

$ huggingface-cli login

Install the ControlNet environment:

conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt

Verify your setup by generating a test image with Stable Diffusion:

import torch
from diffusers import StableDiffusionPipeline

model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
image = pipe("a photo of an astronaut riding a horse on mars").images[0]
image.save("test.png")

TODO

Task 0: Generate Baseline Images

Using the 5 text prompts in ./task_1_controlnet/data/test_prompts.json, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report.

2-1: Implement Zero-Convolution

In diffusion/controlnet.py (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1Γ—1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs.

⬑ Hint Use nn.Conv2d(channels, channels, kernel_size=1) and explicitly set weight.data and bias.data to zero after initialization.

2-2: Initialize ControlNet from Pretrained UNet

In diffusion/controlnet.py (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch.

2-3: Apply Zero-Convolution to Residual Features

In diffusion/controlnet.py (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute:

h_out = ZeroConv(h)

2-4: Integrate ControlNet Outputs into UNet

In diffusion/unets/unet_2d_condition.py (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input:

decoder_input = decoder_input + controlnet_residual

⬑ Important Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling.

2-5: Train and Evaluate

Train ControlNet on the Fill50K dataset (automatically downloaded by the load_dataset() function in train.py) by running:

$ sh train.sh

Then, run inference.ipynb to generate images conditioned on 5 different edge maps from ./data/test_conditions, using the text prompts in data/test_prompts.json.

Include in your report:

  • The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts
  • The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images
  • A brief analysis of each condition: does the generated image accurately follow the edge map?

Question 2 β€’ 25 Marks

Generative Adversarial Networks (GAN)

Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits.

Environment Setup

Create a conda environment named gan_assignment and install the required packages:

conda create --name gan_assignment python=3.10
conda activate gan_assignment
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt

The requirements.txt includes: numpy, matplotlib, scipy, tqdm, and jupyter.

Code Structure

gan_assignment/
β”œβ”€β”€ task_1_vanilla_gan/        (Task 1)
β”‚   β”œβ”€β”€ gan_tutorial.ipynb     <-- Main notebook
β”‚   β”œβ”€β”€ dataset.py             <-- 2D toy dataset definitions
β”‚   β”œβ”€β”€ network.py             <-- (TODO) Generator & Discriminator
β”‚   └── gan.py                 <-- (TODO) GAN training pipeline
β”‚
β”œβ”€β”€ task_2_dcgan/              (Task 2)
β”‚   β”œβ”€β”€ dcgan_tutorial.ipynb   <-- Main notebook
β”‚   β”œβ”€β”€ network.py             <-- (TODO) DCGAN architecture
β”‚   └── dcgan.py               <-- (TODO) DCGAN training loop
└── requirements.txt

Task 1: Vanilla GAN on 2D Swiss-Roll Data

Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation.

TODO

1-1: Build the Generator Network

Implement the Generator class in network.py. The Generator maps a noise vector z to a 2D output point:

  • Input: noise vector z of shape (batch_size, latent_dim), with latent_dim = 16 by default
  • Architecture: fully-connected layers with dimensions [latent_dim, dim_hids[0], …, dim_hids[-1], 2]
  • Activation: ReLU after every hidden layer (except the final output layer)
  • Output: 2D point of shape (batch_size, 2) with a Tanh activation on the last layer

⬑ Hint Use nn.Sequential or nn.ModuleList to stack your layers.

1-2: Build the Discriminator Network

Implement the Discriminator class in network.py. The Discriminator takes a 2D point and outputs a real/fake probability:

  • Input: a 2D point of shape (batch_size, 2)
  • Architecture: fully-connected layers with dimensions [2, dim_hids[0], …, dim_hids[-1], 1]
  • Activation: LeakyReLU (negative slope = 0.2) after every hidden layer
  • Output: a scalar of shape (batch_size, 1) with a Sigmoid activation to produce a probability in [0, 1]

1-3: Implement the GAN Training Step

In gan.py, implement the train_step() function which performs one full update of both G and D:

1. Discriminator update:

  • Sample a real batch x from the dataset

  • Sample z ~ N(0, I) and generate fake samples: x_fake = G(z)

  • Compute the discriminator BCE loss:

    L_D = βˆ’E[log D(x_real)] βˆ’ E[log(1 βˆ’ D(x_fake.detach()))]
    
  • Zero grad on D optimizer, backpropagate, and update D only

2. Generator update:

  • Sample a new batch of z ~ N(0, I)

  • Compute the non-saturating generator loss:

    L_G = βˆ’E[log D(G(z))]
    
  • Zero grad on G optimizer, backpropagate, and update G only

⬑ Important Always call .detach() on x_fake before passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step.

1-4: Implement the Sampling Function

In gan.py, implement sample(G, n_samples, latent_dim, device):

  • Sample n_samples noise vectors z from N(0, I) with shape (n_samples, latent_dim)
  • Pass through G to get generated 2D points
  • Return as a NumPy array of shape (n_samples, 2)
  • Use torch.no_grad() to disable gradient tracking during inference

1-5: Training and Evaluation

Run gan_tutorial.ipynb. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points.

Include in your report:

  • G and D training loss curves (on the same plot or side-by-side)
  • The Chamfer Distance (CD) value
  • A scatter plot of generated 2D points vs. real Swiss-Roll data
  • Brief analysis (2–3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability?

Task 2: Deep Convolutional GAN (DCGAN) on MNIST

Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality.

TODO

2-1: Implement the DCGAN Generator

Implement DCGenerator in task_2_dcgan/network.py using transposed convolutions to upsample from noise to a full image:

  • Input: noise vector z of shape (batch_size, latent_dim, 1, 1), where latent_dim = 100
  • Use ConvTranspose2d layers to upsample progressively to (1, 28, 28)
  • Channel sequence: latent_dim β†’ 256 β†’ 128 β†’ 64 β†’ 1
  • Apply BatchNorm2d + ReLU after every ConvTranspose2d except the last
  • Apply Tanh to the final output

⬑ Tip ConvTranspose2d(kernel_size=4, stride=2, padding=1) doubles spatial resolution. Use kernel_size=4, stride=1, padding=0 for the first layer to go from 1Γ—1 to 4Γ—4.

2-2: Implement the DCGAN Discriminator

Implement DCDiscriminator in task_2_dcgan/network.py using strided convolutions to downsample the input image:

  • Input: grayscale image of shape (batch_size, 1, 28, 28)
  • Use Conv2d layers to downsample to a single scalar output
  • Channel sequence: 1 β†’ 64 β†’ 128 β†’ 256 β†’ 1
  • Apply BatchNorm2d + LeakyReLU (slope 0.2) after every Conv2d except the first and last
  • Apply Sigmoid to the final output

⬑ Important Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability.

2-3: Implement the DCGAN Training Loop

In task_2_dcgan/dcgan.py, implement train_one_epoch() which iterates over the full MNIST training set for one epoch. For each mini-batch:

1. Discriminator update:

  • BCE loss on real images (label = 1) β†’ L_D_real
  • BCE loss on fake images G(z) (label = 0) β†’ L_D_fake
  • L_D = L_D_real + L_D_fake β†’ zero_grad, backward, step D optimizer

2. Generator update:

  • Generate new fake images and compute: L_G = BCE(D(G(z)), 1)
  • zero_grad, backward, step G optimizer

2-4: Weight Initialization

Implement weights_init() in task_2_dcgan/network.py and apply it via model.apply(weights_init):

  • Conv2d and ConvTranspose2d: initialize weights ~ N(0, 0.02)
  • BatchNorm2d: initialize weights ~ N(1.0, 0.02), bias = 0
  • All other layer types: leave unchanged

⬑ Hint Use isinstance(m, nn.Conv2d) to check layer types. Use torch.nn.init.normal_() for weight initialization.

2-5: Training and Evaluation

Run dcgan_tutorial.ipynb. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4Γ—8 grid of generated digits per epoch, and reports the FrΓ©chet Inception Distance (FID) score.

Include in your report:

  • G and D training loss curves over all iterations
  • A 4Γ—8 grid of generated MNIST digits from your final trained model
  • The FID score reported by the notebook
  • Brief analysis (2–3 sentences): comment on image quality, diversity, and any observed instability

Combined Submission Instructions

Both questions β€” one zip file β€” one PDF report

What to Submit

You will submit everything β€” both Question 1 (DDPM) and Question 2 (GAN) β€” in a single zip file. There is no separate submission per question.

Zip File Structure

Your zip file must follow this exact folder layout:

{NAME}_{STUDENT_ID}.zip
β”œβ”€β”€ ddpm_assignment/
β”‚   β”œβ”€β”€ 2d_plot_diffusion_todo/
β”‚   β”‚   β”œβ”€β”€ network.py                <-- Your implementation
β”‚   β”‚   └── ddpm.py                   <-- Your implementation
β”‚   └── task_1_controlnet/
β”‚       └── diffusion/
β”‚           β”œβ”€β”€ controlnet.py         <-- Your implementation
β”‚           └── unets/
β”‚               └── unet_2d_condition.py   <-- Your implementation
β”‚
β”œβ”€β”€ gan_assignment/
β”‚   β”œβ”€β”€ task_1_vanilla_gan/
β”‚   β”‚   β”œβ”€β”€ network.py                <-- Your implementation
β”‚   β”‚   └── gan.py                    <-- Your implementation
β”‚   └── task_2_dcgan/
β”‚       β”œβ”€β”€ network.py                <-- Your implementation
β”‚       └── dcgan.py                  <-- Your implementation
β”‚
└── {NAME}_{STUDENT_ID}.pdf           <-- Combined report

Combined PDF Report

Write one single PDF report named {NAME}_{STUDENT_ID}.pdf that covers both questions. The report must not exceed 5 pages (excluding references). It should contain the following sections in order:

Section 1 β€” DDPM (Question 1):

  • Task 1: Training loss curve, CD value, particle visualization, and 2–3 sentence analysis
  • Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis

Section 2 β€” GAN (Question 2):

  • Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2–3 sentence analysis
  • Task 2: G and D loss curves, 4Γ—8 generated MNIST grid, FID score, and 2–3 sentence analysis

Naming Convention

Do NOT include in your zip:

  • Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.)
  • Model checkpoints (.pth, .ckpt files)
  • Generated image folders
  • Pretrained model weights (e.g., the Stable Diffusion checkpoint)
Item Format
Zip file {NAME}_{STUDENT_ID}.zip β€” e.g. JOHN_DOE_2024001.zip
PDF report {NAME}_{STUDENT_ID}.pdf β€” e.g. JOHN_DOE_2024001.pdf

Academic Integrity

You may consult the following reference papers while working on this assignment:

  • Ho et al. (2020). Denoising Diffusion Probabilistic Models.
  • Zhang et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).
  • Goodfellow et al. (2014). Generative Adversarial Networks.
  • Radford et al. (2015). Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).

It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment.