GEN AI β Programming Assignment
Generative Models
Submission Policy
Both questions must be submitted together in a SINGLE zip file named:
{NAME}_{STUDENT_ID}.zip
The zip file must contain all code folders for both questions and one combined PDF report.
Do NOT include datasets, model checkpoints, or large binary files.
Question 1 β’ 25 Marks
Denoising Diffusion Probabilistic Models (DDPM)
Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning.
Environment Setup
Create a conda environment named ddpm and install PyTorch:
conda create --name ddpm python=3.10
conda activate ddpm
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
Code Structure
ddpm_assignment/
βββ 2d_plot_diffusion_todo/ (Task 1)
β βββ ddpm_tutorial.ipynb <-- Main notebook
β βββ dataset.py <-- Swiss-roll, moon, gaussians
β βββ network.py <-- (TODO) Noise prediction network
β βββ ddpm.py <-- (TODO) DDPM pipeline
β
βββ task_1_controlnet/ (Task 2)
β βββ diffusion/
β β βββ unets/
β β β βββ unet_2d_condition.py <-- (TODO) Integrate ControlNet into UNet
β β β βββ unet_2d_blocks.py <-- Basic UNet components
β β βββ controlnet.py <-- (TODO) Implement ControlNet
β β βββ pipeline_controlnet.py <-- Diffusion pipeline with ControlNet
β βββ train.py <-- Training code
β βββ train.sh <-- Hyperparameter script
β βββ inference.ipynb <-- Inference notebook
βββ requirements.txt
Background
Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise.
A typical DDPM pipeline consists of three components:
- Forward Process: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence xβ β xβ β β¦ β x_T
- Reverse Process: A learned neural network iteratively denoises x_T back to xβ, step by step
- Training Objective: The network is trained using a simplified noise-matching loss β predicting the noise Ξ΅ added at each step
Task 1: Simple DDPM Pipeline with Swiss-Roll
In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images.
After completing your implementation, train the model and evaluate it by running ddpm_tutorial.ipynb in the 2d_plot_diffusion_todo directory.
TODO
1-1: Build a Noise Prediction Network
Implement the noise prediction network in network.py. The network takes a noisy data point and a timestep embedding as input, and predicts the noise Ξ΅ added at that step. It should consist of TimeLinear layers with feature dimensions:
[dim_in, dim_hids[0], ..., dim_hids[-1], dim_out]
- Every
TimeLinearlayer except the final output layer must be followed by a ReLU activation - The final layer has no activation β it directly outputs the predicted noise
⬑ Hint
TimeLinearis a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer.
1-2: Construct the Forward and Reverse Process
In ddpm.py, implement the three core functions of the DDPM pipeline:
q_sample(x_0, t, noise): The forward process. Given a clean sample xβ and timestep t, return the noised sample x_t using the closed-form formula:x_t = βαΎ±_t Β· xβ + β(1 β αΎ±_t) Β· Ξ΅, where Ξ΅ ~ N(0, I)p_sample(x_t, t): One-step reverse transition. Use the trained network to predict Ξ΅, then compute the denoised estimate of x_{t-1}p_sample_loop(shape): Full reverse process. Starting from x_T ~ N(0, I), iteratep_sample()from t=T down to t=1 and return the final sample xβ
⬑ Important Use the pre-computed noise schedule (α_t, ᾱ_t, β_t) provided in the starter code. Do not redefine the schedule inside these functions.
1-3: Implement the Training Objective
In ddpm.py, implement compute_loss(). This function should:
- Sample a random timestep t uniformly from {1, β¦, T} for each element in the batch
- Sample noise Ξ΅ ~ N(0, I) of the same shape as the input xβ
- Compute the noised sample x_t using
q_sample() - Pass x_t and t to the noise prediction network to obtain the predicted noise Ξ΅Μ
- Return the simplified noise-matching loss: L = ||Ξ΅ β Ξ΅Μ||Β²
1-4: Training and Evaluation
Once your implementation is complete, open and run ddpm_tutorial.ipynb via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution.
Include in your report:
- The training loss curve
- The Chamfer Distance (CD) value reported after running the notebook
- A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution
Task 2: ControlNet on Fill50K Dataset
In this task, you will implement ControlNet β a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers.
Prerequisites: Hugging Face Setup
Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model:
- Sign into Hugging Face at https://huggingface.co
- Obtain your Access Token at https://huggingface.co/settings/tokens
- Log in from your terminal:
$ huggingface-cli login
Install the ControlNet environment:
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
Verify your setup by generating a test image with Stable Diffusion:
import torch
from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
image = pipe("a photo of an astronaut riding a horse on mars").images[0]
image.save("test.png")
TODO
Task 0: Generate Baseline Images
Using the 5 text prompts in ./task_1_controlnet/data/test_prompts.json, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report.
2-1: Implement Zero-Convolution
In diffusion/controlnet.py (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1Γ1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs.
⬑ Hint Use
nn.Conv2d(channels, channels, kernel_size=1)and explicitly setweight.dataandbias.datato zero after initialization.
2-2: Initialize ControlNet from Pretrained UNet
In diffusion/controlnet.py (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch.
2-3: Apply Zero-Convolution to Residual Features
In diffusion/controlnet.py (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute:
h_out = ZeroConv(h)
2-4: Integrate ControlNet Outputs into UNet
In diffusion/unets/unet_2d_condition.py (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input:
decoder_input = decoder_input + controlnet_residual
⬑ Important Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling.
2-5: Train and Evaluate
Train ControlNet on the Fill50K dataset (automatically downloaded by the load_dataset() function in train.py) by running:
$ sh train.sh
Then, run inference.ipynb to generate images conditioned on 5 different edge maps from ./data/test_conditions, using the text prompts in data/test_prompts.json.
Include in your report:
- The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts
- The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images
- A brief analysis of each condition: does the generated image accurately follow the edge map?
Question 2 β’ 25 Marks
Generative Adversarial Networks (GAN)
Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits.
Environment Setup
Create a conda environment named gan_assignment and install the required packages:
conda create --name gan_assignment python=3.10
conda activate gan_assignment
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
The requirements.txt includes: numpy, matplotlib, scipy, tqdm, and jupyter.
Code Structure
gan_assignment/
βββ task_1_vanilla_gan/ (Task 1)
β βββ gan_tutorial.ipynb <-- Main notebook
β βββ dataset.py <-- 2D toy dataset definitions
β βββ network.py <-- (TODO) Generator & Discriminator
β βββ gan.py <-- (TODO) GAN training pipeline
β
βββ task_2_dcgan/ (Task 2)
β βββ dcgan_tutorial.ipynb <-- Main notebook
β βββ network.py <-- (TODO) DCGAN architecture
β βββ dcgan.py <-- (TODO) DCGAN training loop
βββ requirements.txt
Task 1: Vanilla GAN on 2D Swiss-Roll Data
Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation.
TODO
1-1: Build the Generator Network
Implement the Generator class in network.py. The Generator maps a noise vector z to a 2D output point:
- Input: noise vector z of shape
(batch_size, latent_dim), withlatent_dim = 16by default - Architecture: fully-connected layers with dimensions
[latent_dim, dim_hids[0], β¦, dim_hids[-1], 2] - Activation: ReLU after every hidden layer (except the final output layer)
- Output: 2D point of shape
(batch_size, 2)with a Tanh activation on the last layer
⬑ Hint Use
nn.Sequentialornn.ModuleListto stack your layers.
1-2: Build the Discriminator Network
Implement the Discriminator class in network.py. The Discriminator takes a 2D point and outputs a real/fake probability:
- Input: a 2D point of shape
(batch_size, 2) - Architecture: fully-connected layers with dimensions
[2, dim_hids[0], β¦, dim_hids[-1], 1] - Activation: LeakyReLU (negative slope = 0.2) after every hidden layer
- Output: a scalar of shape
(batch_size, 1)with a Sigmoid activation to produce a probability in [0, 1]
1-3: Implement the GAN Training Step
In gan.py, implement the train_step() function which performs one full update of both G and D:
1. Discriminator update:
Sample a real batch x from the dataset
Sample z ~ N(0, I) and generate fake samples:
x_fake = G(z)Compute the discriminator BCE loss:
L_D = βE[log D(x_real)] β E[log(1 β D(x_fake.detach()))]Zero grad on D optimizer, backpropagate, and update D only
2. Generator update:
Sample a new batch of z ~ N(0, I)
Compute the non-saturating generator loss:
L_G = βE[log D(G(z))]Zero grad on G optimizer, backpropagate, and update G only
⬑ Important Always call
.detach()onx_fakebefore passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step.
1-4: Implement the Sampling Function
In gan.py, implement sample(G, n_samples, latent_dim, device):
- Sample
n_samplesnoise vectors z from N(0, I) with shape(n_samples, latent_dim) - Pass through G to get generated 2D points
- Return as a NumPy array of shape
(n_samples, 2) - Use
torch.no_grad()to disable gradient tracking during inference
1-5: Training and Evaluation
Run gan_tutorial.ipynb. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points.
Include in your report:
- G and D training loss curves (on the same plot or side-by-side)
- The Chamfer Distance (CD) value
- A scatter plot of generated 2D points vs. real Swiss-Roll data
- Brief analysis (2β3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability?
Task 2: Deep Convolutional GAN (DCGAN) on MNIST
Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality.
TODO
2-1: Implement the DCGAN Generator
Implement DCGenerator in task_2_dcgan/network.py using transposed convolutions to upsample from noise to a full image:
- Input: noise vector z of shape
(batch_size, latent_dim, 1, 1), wherelatent_dim = 100 - Use
ConvTranspose2dlayers to upsample progressively to(1, 28, 28) - Channel sequence:
latent_dim β 256 β 128 β 64 β 1 - Apply
BatchNorm2d + ReLUafter everyConvTranspose2dexcept the last - Apply Tanh to the final output
⬑ Tip
ConvTranspose2d(kernel_size=4, stride=2, padding=1)doubles spatial resolution. Usekernel_size=4, stride=1, padding=0for the first layer to go from 1Γ1 to 4Γ4.
2-2: Implement the DCGAN Discriminator
Implement DCDiscriminator in task_2_dcgan/network.py using strided convolutions to downsample the input image:
- Input: grayscale image of shape
(batch_size, 1, 28, 28) - Use
Conv2dlayers to downsample to a single scalar output - Channel sequence:
1 β 64 β 128 β 256 β 1 - Apply
BatchNorm2d + LeakyReLU(slope 0.2) after everyConv2dexcept the first and last - Apply Sigmoid to the final output
⬑ Important Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability.
2-3: Implement the DCGAN Training Loop
In task_2_dcgan/dcgan.py, implement train_one_epoch() which iterates over the full MNIST training set for one epoch. For each mini-batch:
1. Discriminator update:
- BCE loss on real images (label = 1) β
L_D_real - BCE loss on fake images G(z) (label = 0) β
L_D_fake L_D = L_D_real + L_D_fakeβzero_grad,backward,stepD optimizer
2. Generator update:
- Generate new fake images and compute:
L_G = BCE(D(G(z)), 1) zero_grad,backward,stepG optimizer
2-4: Weight Initialization
Implement weights_init() in task_2_dcgan/network.py and apply it via model.apply(weights_init):
Conv2dandConvTranspose2d: initialize weights ~ N(0, 0.02)BatchNorm2d: initialize weights ~ N(1.0, 0.02), bias = 0- All other layer types: leave unchanged
⬑ Hint Use
isinstance(m, nn.Conv2d)to check layer types. Usetorch.nn.init.normal_()for weight initialization.
2-5: Training and Evaluation
Run dcgan_tutorial.ipynb. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4Γ8 grid of generated digits per epoch, and reports the FrΓ©chet Inception Distance (FID) score.
Include in your report:
- G and D training loss curves over all iterations
- A 4Γ8 grid of generated MNIST digits from your final trained model
- The FID score reported by the notebook
- Brief analysis (2β3 sentences): comment on image quality, diversity, and any observed instability
Combined Submission Instructions
Both questions β one zip file β one PDF report
What to Submit
You will submit everything β both Question 1 (DDPM) and Question 2 (GAN) β in a single zip file. There is no separate submission per question.
Zip File Structure
Your zip file must follow this exact folder layout:
{NAME}_{STUDENT_ID}.zip
βββ ddpm_assignment/
β βββ 2d_plot_diffusion_todo/
β β βββ network.py <-- Your implementation
β β βββ ddpm.py <-- Your implementation
β βββ task_1_controlnet/
β βββ diffusion/
β βββ controlnet.py <-- Your implementation
β βββ unets/
β βββ unet_2d_condition.py <-- Your implementation
β
βββ gan_assignment/
β βββ task_1_vanilla_gan/
β β βββ network.py <-- Your implementation
β β βββ gan.py <-- Your implementation
β βββ task_2_dcgan/
β βββ network.py <-- Your implementation
β βββ dcgan.py <-- Your implementation
β
βββ {NAME}_{STUDENT_ID}.pdf <-- Combined report
Combined PDF Report
Write one single PDF report named {NAME}_{STUDENT_ID}.pdf that covers both questions. The report must not exceed 5 pages (excluding references). It should contain the following sections in order:
Section 1 β DDPM (Question 1):
- Task 1: Training loss curve, CD value, particle visualization, and 2β3 sentence analysis
- Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis
Section 2 β GAN (Question 2):
- Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2β3 sentence analysis
- Task 2: G and D loss curves, 4Γ8 generated MNIST grid, FID score, and 2β3 sentence analysis
Naming Convention
Do NOT include in your zip:
- Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.)
- Model checkpoints (
.pth,.ckptfiles) - Generated image folders
- Pretrained model weights (e.g., the Stable Diffusion checkpoint)
| Item | Format |
|---|---|
| Zip file | {NAME}_{STUDENT_ID}.zip β e.g. JOHN_DOE_2024001.zip |
| PDF report | {NAME}_{STUDENT_ID}.pdf β e.g. JOHN_DOE_2024001.pdf |
Academic Integrity
You may consult the following reference papers while working on this assignment:
- Ho et al. (2020). Denoising Diffusion Probabilistic Models.
- Zhang et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).
- Goodfellow et al. (2014). Generative Adversarial Networks.
- Radford et al. (2015). Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).
It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment.