genai-assignment-shivank / assignment.md
shivank21's picture
Add files using upload-large-folder tool
2ec1243 verified
# GEN AI β€” Programming Assignment
## Generative Models
---
## Submission Policy
Both questions must be submitted together in a **SINGLE zip file** named:
```
{NAME}_{STUDENT_ID}.zip
```
The zip file must contain all code folders for both questions and one combined PDF report.
**Do NOT include** datasets, model checkpoints, or large binary files.
---
# Question 1 β€’ 25 Marks
## Denoising Diffusion Probabilistic Models (DDPM)
> Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning.
### Environment Setup
Create a conda environment named `ddpm` and install PyTorch:
```bash
conda create --name ddpm python=3.10
conda activate ddpm
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
```
### Code Structure
```
ddpm_assignment/
β”œβ”€β”€ 2d_plot_diffusion_todo/ (Task 1)
β”‚ β”œβ”€β”€ ddpm_tutorial.ipynb <-- Main notebook
β”‚ β”œβ”€β”€ dataset.py <-- Swiss-roll, moon, gaussians
β”‚ β”œβ”€β”€ network.py <-- (TODO) Noise prediction network
β”‚ └── ddpm.py <-- (TODO) DDPM pipeline
β”‚
β”œβ”€β”€ task_1_controlnet/ (Task 2)
β”‚ β”œβ”€β”€ diffusion/
β”‚ β”‚ β”œβ”€β”€ unets/
β”‚ β”‚ β”‚ β”œβ”€β”€ unet_2d_condition.py <-- (TODO) Integrate ControlNet into UNet
β”‚ β”‚ β”‚ └── unet_2d_blocks.py <-- Basic UNet components
β”‚ β”‚ β”œβ”€β”€ controlnet.py <-- (TODO) Implement ControlNet
β”‚ β”‚ └── pipeline_controlnet.py <-- Diffusion pipeline with ControlNet
β”‚ β”œβ”€β”€ train.py <-- Training code
β”‚ β”œβ”€β”€ train.sh <-- Hyperparameter script
β”‚ └── inference.ipynb <-- Inference notebook
└── requirements.txt
```
### Background
Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise.
A typical DDPM pipeline consists of three components:
- **Forward Process**: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence xβ‚€ β†’ x₁ β†’ … β†’ x_T
- **Reverse Process**: A learned neural network iteratively denoises x_T back to xβ‚€, step by step
- **Training Objective**: The network is trained using a simplified noise-matching loss β€” predicting the noise Ξ΅ added at each step
---
## Task 1: Simple DDPM Pipeline with Swiss-Roll
In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images.
After completing your implementation, train the model and evaluate it by running `ddpm_tutorial.ipynb` in the `2d_plot_diffusion_todo` directory.
### TODO
#### 1-1: Build a Noise Prediction Network
Implement the noise prediction network in `network.py`. The network takes a noisy data point and a timestep embedding as input, and predicts the noise Ξ΅ added at that step. It should consist of `TimeLinear` layers with feature dimensions:
```
[dim_in, dim_hids[0], ..., dim_hids[-1], dim_out]
```
- Every `TimeLinear` layer except the final output layer must be followed by a ReLU activation
- The final layer has no activation β€” it directly outputs the predicted noise
> **⬑ Hint**
> `TimeLinear` is a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer.
#### 1-2: Construct the Forward and Reverse Process
In `ddpm.py`, implement the three core functions of the DDPM pipeline:
- **`q_sample(x_0, t, noise)`**: The forward process. Given a clean sample xβ‚€ and timestep t, return the noised sample x_t using the closed-form formula:
```
x_t = √ᾱ_t Β· xβ‚€ + √(1 βˆ’ αΎ±_t) Β· Ξ΅, where Ξ΅ ~ N(0, I)
```
- **`p_sample(x_t, t)`**: One-step reverse transition. Use the trained network to predict Ξ΅, then compute the denoised estimate of x_{t-1}
- **`p_sample_loop(shape)`**: Full reverse process. Starting from x_T ~ N(0, I), iterate `p_sample()` from t=T down to t=1 and return the final sample xβ‚€
> **⬑ Important**
> Use the pre-computed noise schedule (Ξ±_t, αΎ±_t, Ξ²_t) provided in the starter code. Do not redefine the schedule inside these functions.
#### 1-3: Implement the Training Objective
In `ddpm.py`, implement `compute_loss()`. This function should:
1. Sample a random timestep t uniformly from {1, …, T} for each element in the batch
2. Sample noise Ξ΅ ~ N(0, I) of the same shape as the input xβ‚€
3. Compute the noised sample x_t using `q_sample()`
4. Pass x_t and t to the noise prediction network to obtain the predicted noise Ξ΅Μ‚
5. Return the simplified noise-matching loss: **L = ||Ξ΅ βˆ’ Ξ΅Μ‚||Β²**
#### 1-4: Training and Evaluation
Once your implementation is complete, open and run `ddpm_tutorial.ipynb` via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution.
**Include in your report:**
- The training loss curve
- The Chamfer Distance (CD) value reported after running the notebook
- A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution
---
## Task 2: ControlNet on Fill50K Dataset
In this task, you will implement ControlNet β€” a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers.
### Prerequisites: Hugging Face Setup
Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model:
- Sign into Hugging Face at https://huggingface.co
- Obtain your Access Token at https://huggingface.co/settings/tokens
- Log in from your terminal:
```bash
$ huggingface-cli login
```
Install the ControlNet environment:
```bash
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
```
Verify your setup by generating a test image with Stable Diffusion:
```python
import torch
from diffusers import StableDiffusionPipeline
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
image = pipe("a photo of an astronaut riding a horse on mars").images[0]
image.save("test.png")
```
### TODO
#### Task 0: Generate Baseline Images
Using the 5 text prompts in `./task_1_controlnet/data/test_prompts.json`, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report.
#### 2-1: Implement Zero-Convolution
In `diffusion/controlnet.py` (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1Γ—1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs.
> **⬑ Hint**
> Use `nn.Conv2d(channels, channels, kernel_size=1)` and explicitly set `weight.data` and `bias.data` to zero after initialization.
#### 2-2: Initialize ControlNet from Pretrained UNet
In `diffusion/controlnet.py` (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch.
#### 2-3: Apply Zero-Convolution to Residual Features
In `diffusion/controlnet.py` (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute:
```
h_out = ZeroConv(h)
```
#### 2-4: Integrate ControlNet Outputs into UNet
In `diffusion/unets/unet_2d_condition.py` (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input:
```
decoder_input = decoder_input + controlnet_residual
```
> **⬑ Important**
> Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling.
#### 2-5: Train and Evaluate
Train ControlNet on the Fill50K dataset (automatically downloaded by the `load_dataset()` function in `train.py`) by running:
```bash
$ sh train.sh
```
Then, run `inference.ipynb` to generate images conditioned on 5 different edge maps from `./data/test_conditions`, using the text prompts in `data/test_prompts.json`.
**Include in your report:**
- The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts
- The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images
- A brief analysis of each condition: does the generated image accurately follow the edge map?
---
# Question 2 β€’ 25 Marks
## Generative Adversarial Networks (GAN)
> Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits.
### Environment Setup
Create a conda environment named `gan_assignment` and install the required packages:
```bash
conda create --name gan_assignment python=3.10
conda activate gan_assignment
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
```
The `requirements.txt` includes: `numpy`, `matplotlib`, `scipy`, `tqdm`, and `jupyter`.
### Code Structure
```
gan_assignment/
β”œβ”€β”€ task_1_vanilla_gan/ (Task 1)
β”‚ β”œβ”€β”€ gan_tutorial.ipynb <-- Main notebook
β”‚ β”œβ”€β”€ dataset.py <-- 2D toy dataset definitions
β”‚ β”œβ”€β”€ network.py <-- (TODO) Generator & Discriminator
β”‚ └── gan.py <-- (TODO) GAN training pipeline
β”‚
β”œβ”€β”€ task_2_dcgan/ (Task 2)
β”‚ β”œβ”€β”€ dcgan_tutorial.ipynb <-- Main notebook
β”‚ β”œβ”€β”€ network.py <-- (TODO) DCGAN architecture
β”‚ └── dcgan.py <-- (TODO) DCGAN training loop
└── requirements.txt
```
---
## Task 1: Vanilla GAN on 2D Swiss-Roll Data
Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation.
### TODO
#### 1-1: Build the Generator Network
Implement the `Generator` class in `network.py`. The Generator maps a noise vector z to a 2D output point:
- **Input**: noise vector z of shape `(batch_size, latent_dim)`, with `latent_dim = 16` by default
- **Architecture**: fully-connected layers with dimensions `[latent_dim, dim_hids[0], …, dim_hids[-1], 2]`
- **Activation**: ReLU after every hidden layer (except the final output layer)
- **Output**: 2D point of shape `(batch_size, 2)` with a Tanh activation on the last layer
> **⬑ Hint**
> Use `nn.Sequential` or `nn.ModuleList` to stack your layers.
#### 1-2: Build the Discriminator Network
Implement the `Discriminator` class in `network.py`. The Discriminator takes a 2D point and outputs a real/fake probability:
- **Input**: a 2D point of shape `(batch_size, 2)`
- **Architecture**: fully-connected layers with dimensions `[2, dim_hids[0], …, dim_hids[-1], 1]`
- **Activation**: LeakyReLU (negative slope = 0.2) after every hidden layer
- **Output**: a scalar of shape `(batch_size, 1)` with a Sigmoid activation to produce a probability in [0, 1]
#### 1-3: Implement the GAN Training Step
In `gan.py`, implement the `train_step()` function which performs one full update of both G and D:
**1. Discriminator update:**
- Sample a real batch x from the dataset
- Sample z ~ N(0, I) and generate fake samples: `x_fake = G(z)`
- Compute the discriminator BCE loss:
```
L_D = βˆ’E[log D(x_real)] βˆ’ E[log(1 βˆ’ D(x_fake.detach()))]
```
- Zero grad on D optimizer, backpropagate, and update D only
**2. Generator update:**
- Sample a new batch of z ~ N(0, I)
- Compute the non-saturating generator loss:
```
L_G = βˆ’E[log D(G(z))]
```
- Zero grad on G optimizer, backpropagate, and update G only
> **⬑ Important**
> Always call `.detach()` on `x_fake` before passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step.
#### 1-4: Implement the Sampling Function
In `gan.py`, implement `sample(G, n_samples, latent_dim, device)`:
- Sample `n_samples` noise vectors z from N(0, I) with shape `(n_samples, latent_dim)`
- Pass through G to get generated 2D points
- Return as a NumPy array of shape `(n_samples, 2)`
- Use `torch.no_grad()` to disable gradient tracking during inference
#### 1-5: Training and Evaluation
Run `gan_tutorial.ipynb`. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points.
**Include in your report:**
- G and D training loss curves (on the same plot or side-by-side)
- The Chamfer Distance (CD) value
- A scatter plot of generated 2D points vs. real Swiss-Roll data
- Brief analysis (2–3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability?
---
## Task 2: Deep Convolutional GAN (DCGAN) on MNIST
Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality.
### TODO
#### 2-1: Implement the DCGAN Generator
Implement `DCGenerator` in `task_2_dcgan/network.py` using transposed convolutions to upsample from noise to a full image:
- **Input**: noise vector z of shape `(batch_size, latent_dim, 1, 1)`, where `latent_dim = 100`
- Use `ConvTranspose2d` layers to upsample progressively to `(1, 28, 28)`
- **Channel sequence**: `latent_dim β†’ 256 β†’ 128 β†’ 64 β†’ 1`
- Apply `BatchNorm2d + ReLU` after every `ConvTranspose2d` except the last
- Apply Tanh to the final output
> **⬑ Tip**
> `ConvTranspose2d(kernel_size=4, stride=2, padding=1)` doubles spatial resolution. Use `kernel_size=4, stride=1, padding=0` for the first layer to go from 1Γ—1 to 4Γ—4.
#### 2-2: Implement the DCGAN Discriminator
Implement `DCDiscriminator` in `task_2_dcgan/network.py` using strided convolutions to downsample the input image:
- **Input**: grayscale image of shape `(batch_size, 1, 28, 28)`
- Use `Conv2d` layers to downsample to a single scalar output
- **Channel sequence**: `1 β†’ 64 β†’ 128 β†’ 256 β†’ 1`
- Apply `BatchNorm2d + LeakyReLU` (slope 0.2) after every `Conv2d` except the first and last
- Apply Sigmoid to the final output
> **⬑ Important**
> Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability.
#### 2-3: Implement the DCGAN Training Loop
In `task_2_dcgan/dcgan.py`, implement `train_one_epoch()` which iterates over the full MNIST training set for one epoch. For each mini-batch:
**1. Discriminator update:**
- BCE loss on real images (label = 1) β†’ `L_D_real`
- BCE loss on fake images G(z) (label = 0) β†’ `L_D_fake`
- `L_D = L_D_real + L_D_fake` β†’ `zero_grad`, `backward`, `step` D optimizer
**2. Generator update:**
- Generate new fake images and compute: `L_G = BCE(D(G(z)), 1)`
- `zero_grad`, `backward`, `step` G optimizer
#### 2-4: Weight Initialization
Implement `weights_init()` in `task_2_dcgan/network.py` and apply it via `model.apply(weights_init)`:
- `Conv2d` and `ConvTranspose2d`: initialize weights ~ N(0, 0.02)
- `BatchNorm2d`: initialize weights ~ N(1.0, 0.02), bias = 0
- All other layer types: leave unchanged
> **⬑ Hint**
> Use `isinstance(m, nn.Conv2d)` to check layer types. Use `torch.nn.init.normal_()` for weight initialization.
#### 2-5: Training and Evaluation
Run `dcgan_tutorial.ipynb`. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4Γ—8 grid of generated digits per epoch, and reports the FrΓ©chet Inception Distance (FID) score.
**Include in your report:**
- G and D training loss curves over all iterations
- A 4Γ—8 grid of generated MNIST digits from your final trained model
- The FID score reported by the notebook
- Brief analysis (2–3 sentences): comment on image quality, diversity, and any observed instability
---
# Combined Submission Instructions
> **Both questions β€” one zip file β€” one PDF report**
## What to Submit
You will submit everything β€” both Question 1 (DDPM) and Question 2 (GAN) β€” in a single zip file. There is no separate submission per question.
### Zip File Structure
Your zip file must follow this exact folder layout:
```
{NAME}_{STUDENT_ID}.zip
β”œβ”€β”€ ddpm_assignment/
β”‚ β”œβ”€β”€ 2d_plot_diffusion_todo/
β”‚ β”‚ β”œβ”€β”€ network.py <-- Your implementation
β”‚ β”‚ └── ddpm.py <-- Your implementation
β”‚ └── task_1_controlnet/
β”‚ └── diffusion/
β”‚ β”œβ”€β”€ controlnet.py <-- Your implementation
β”‚ └── unets/
β”‚ └── unet_2d_condition.py <-- Your implementation
β”‚
β”œβ”€β”€ gan_assignment/
β”‚ β”œβ”€β”€ task_1_vanilla_gan/
β”‚ β”‚ β”œβ”€β”€ network.py <-- Your implementation
β”‚ β”‚ └── gan.py <-- Your implementation
β”‚ └── task_2_dcgan/
β”‚ β”œβ”€β”€ network.py <-- Your implementation
β”‚ └── dcgan.py <-- Your implementation
β”‚
└── {NAME}_{STUDENT_ID}.pdf <-- Combined report
```
### Combined PDF Report
Write one single PDF report named `{NAME}_{STUDENT_ID}.pdf` that covers both questions. The report **must not exceed 5 pages** (excluding references). It should contain the following sections in order:
**Section 1 β€” DDPM (Question 1):**
- Task 1: Training loss curve, CD value, particle visualization, and 2–3 sentence analysis
- Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis
**Section 2 β€” GAN (Question 2):**
- Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2–3 sentence analysis
- Task 2: G and D loss curves, 4Γ—8 generated MNIST grid, FID score, and 2–3 sentence analysis
### Naming Convention
**Do NOT include in your zip:**
- Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.)
- Model checkpoints (`.pth`, `.ckpt` files)
- Generated image folders
- Pretrained model weights (e.g., the Stable Diffusion checkpoint)
| Item | Format |
|------------|--------------------------------------------------------------|
| Zip file | `{NAME}_{STUDENT_ID}.zip` β€” e.g. `JOHN_DOE_2024001.zip` |
| PDF report | `{NAME}_{STUDENT_ID}.pdf` β€” e.g. `JOHN_DOE_2024001.pdf` |
---
## Academic Integrity
You may consult the following reference papers while working on this assignment:
- Ho et al. (2020). *Denoising Diffusion Probabilistic Models.*
- Zhang et al. (2023). *Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).*
- Goodfellow et al. (2014). *Generative Adversarial Networks.*
- Radford et al. (2015). *Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).*
> It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment.