File size: 19,997 Bytes

2ec1243

# GEN AI — Programming Assignment

## Generative Models

---

## Submission Policy

Both questions must be submitted together in a **SINGLE zip file** named:

```
{NAME}_{STUDENT_ID}.zip
```

The zip file must contain all code folders for both questions and one combined PDF report.

**Do NOT include** datasets, model checkpoints, or large binary files.

---

# Question 1 • 25 Marks

## Denoising Diffusion Probabilistic Models (DDPM)

> Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning.

### Environment Setup

Create a conda environment named `ddpm` and install PyTorch:

```bash
conda create --name ddpm python=3.10
conda activate ddpm
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
```

### Code Structure

```
ddpm_assignment/
├── 2d_plot_diffusion_todo/        (Task 1)
│   ├── ddpm_tutorial.ipynb        <-- Main notebook
│   ├── dataset.py                 <-- Swiss-roll, moon, gaussians
│   ├── network.py                 <-- (TODO) Noise prediction network
│   └── ddpm.py                    <-- (TODO) DDPM pipeline
│
├── task_1_controlnet/             (Task 2)
│   ├── diffusion/
│   │   ├── unets/
│   │   │   ├── unet_2d_condition.py   <-- (TODO) Integrate ControlNet into UNet
│   │   │   └── unet_2d_blocks.py      <-- Basic UNet components
│   │   ├── controlnet.py              <-- (TODO) Implement ControlNet
│   │   └── pipeline_controlnet.py     <-- Diffusion pipeline with ControlNet
│   ├── train.py                       <-- Training code
│   ├── train.sh                       <-- Hyperparameter script
│   └── inference.ipynb                <-- Inference notebook
└── requirements.txt
```

### Background

Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise.

A typical DDPM pipeline consists of three components:

- **Forward Process**: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence x₀ → x₁ → … → x_T
- **Reverse Process**: A learned neural network iteratively denoises x_T back to x₀, step by step
- **Training Objective**: The network is trained using a simplified noise-matching loss — predicting the noise ε added at each step

---

## Task 1: Simple DDPM Pipeline with Swiss-Roll

In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images.

After completing your implementation, train the model and evaluate it by running `ddpm_tutorial.ipynb` in the `2d_plot_diffusion_todo` directory.

### TODO

#### 1-1: Build a Noise Prediction Network

Implement the noise prediction network in `network.py`. The network takes a noisy data point and a timestep embedding as input, and predicts the noise ε added at that step. It should consist of `TimeLinear` layers with feature dimensions:

```
[dim_in, dim_hids[0], ..., dim_hids[-1], dim_out]
```

- Every `TimeLinear` layer except the final output layer must be followed by a ReLU activation
- The final layer has no activation — it directly outputs the predicted noise

> **⬡ Hint**
> `TimeLinear` is a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer.

#### 1-2: Construct the Forward and Reverse Process

In `ddpm.py`, implement the three core functions of the DDPM pipeline:

- **`q_sample(x_0, t, noise)`**: The forward process. Given a clean sample x₀ and timestep t, return the noised sample x_t using the closed-form formula:

  ```
  x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε,  where ε ~ N(0, I)
  ```

- **`p_sample(x_t, t)`**: One-step reverse transition. Use the trained network to predict ε, then compute the denoised estimate of x_{t-1}

- **`p_sample_loop(shape)`**: Full reverse process. Starting from x_T ~ N(0, I), iterate `p_sample()` from t=T down to t=1 and return the final sample x₀

> **⬡ Important**
> Use the pre-computed noise schedule (α_t, ᾱ_t, β_t) provided in the starter code. Do not redefine the schedule inside these functions.

#### 1-3: Implement the Training Objective

In `ddpm.py`, implement `compute_loss()`. This function should:

1. Sample a random timestep t uniformly from {1, …, T} for each element in the batch
2. Sample noise ε ~ N(0, I) of the same shape as the input x₀
3. Compute the noised sample x_t using `q_sample()`
4. Pass x_t and t to the noise prediction network to obtain the predicted noise ε̂
5. Return the simplified noise-matching loss: **L = ||ε − ε̂||²**

#### 1-4: Training and Evaluation

Once your implementation is complete, open and run `ddpm_tutorial.ipynb` via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution.

**Include in your report:**

- The training loss curve
- The Chamfer Distance (CD) value reported after running the notebook
- A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution

---

## Task 2: ControlNet on Fill50K Dataset

In this task, you will implement ControlNet — a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers.

### Prerequisites: Hugging Face Setup

Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model:

- Sign into Hugging Face at https://huggingface.co
- Obtain your Access Token at https://huggingface.co/settings/tokens
- Log in from your terminal:

```bash
$ huggingface-cli login
```

Install the ControlNet environment:

```bash
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
```

Verify your setup by generating a test image with Stable Diffusion:

```python
import torch
from diffusers import StableDiffusionPipeline

model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
image = pipe("a photo of an astronaut riding a horse on mars").images[0]
image.save("test.png")
```

### TODO

#### Task 0: Generate Baseline Images

Using the 5 text prompts in `./task_1_controlnet/data/test_prompts.json`, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report.

#### 2-1: Implement Zero-Convolution

In `diffusion/controlnet.py` (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1×1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs.

> **⬡ Hint**
> Use `nn.Conv2d(channels, channels, kernel_size=1)` and explicitly set `weight.data` and `bias.data` to zero after initialization.

#### 2-2: Initialize ControlNet from Pretrained UNet

In `diffusion/controlnet.py` (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch.

#### 2-3: Apply Zero-Convolution to Residual Features

In `diffusion/controlnet.py` (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute:

```
h_out = ZeroConv(h)
```

#### 2-4: Integrate ControlNet Outputs into UNet

In `diffusion/unets/unet_2d_condition.py` (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input:

```
decoder_input = decoder_input + controlnet_residual
```

> **⬡ Important**
> Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling.

#### 2-5: Train and Evaluate

Train ControlNet on the Fill50K dataset (automatically downloaded by the `load_dataset()` function in `train.py`) by running:

```bash
$ sh train.sh
```

Then, run `inference.ipynb` to generate images conditioned on 5 different edge maps from `./data/test_conditions`, using the text prompts in `data/test_prompts.json`.

**Include in your report:**

- The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts
- The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images
- A brief analysis of each condition: does the generated image accurately follow the edge map?

---

# Question 2 • 25 Marks

## Generative Adversarial Networks (GAN)

> Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits.

### Environment Setup

Create a conda environment named `gan_assignment` and install the required packages:

```bash
conda create --name gan_assignment python=3.10
conda activate gan_assignment
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -r requirements.txt
```

The `requirements.txt` includes: `numpy`, `matplotlib`, `scipy`, `tqdm`, and `jupyter`.

### Code Structure

```
gan_assignment/
├── task_1_vanilla_gan/        (Task 1)
│   ├── gan_tutorial.ipynb     <-- Main notebook
│   ├── dataset.py             <-- 2D toy dataset definitions
│   ├── network.py             <-- (TODO) Generator & Discriminator
│   └── gan.py                 <-- (TODO) GAN training pipeline
│
├── task_2_dcgan/              (Task 2)
│   ├── dcgan_tutorial.ipynb   <-- Main notebook
│   ├── network.py             <-- (TODO) DCGAN architecture
│   └── dcgan.py               <-- (TODO) DCGAN training loop
└── requirements.txt
```

---

## Task 1: Vanilla GAN on 2D Swiss-Roll Data

Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation.

### TODO

#### 1-1: Build the Generator Network

Implement the `Generator` class in `network.py`. The Generator maps a noise vector z to a 2D output point:

- **Input**: noise vector z of shape `(batch_size, latent_dim)`, with `latent_dim = 16` by default
- **Architecture**: fully-connected layers with dimensions `[latent_dim, dim_hids[0], …, dim_hids[-1], 2]`
- **Activation**: ReLU after every hidden layer (except the final output layer)
- **Output**: 2D point of shape `(batch_size, 2)` with a Tanh activation on the last layer

> **⬡ Hint**
> Use `nn.Sequential` or `nn.ModuleList` to stack your layers.

#### 1-2: Build the Discriminator Network

Implement the `Discriminator` class in `network.py`. The Discriminator takes a 2D point and outputs a real/fake probability:

- **Input**: a 2D point of shape `(batch_size, 2)`
- **Architecture**: fully-connected layers with dimensions `[2, dim_hids[0], …, dim_hids[-1], 1]`
- **Activation**: LeakyReLU (negative slope = 0.2) after every hidden layer
- **Output**: a scalar of shape `(batch_size, 1)` with a Sigmoid activation to produce a probability in [0, 1]

#### 1-3: Implement the GAN Training Step

In `gan.py`, implement the `train_step()` function which performs one full update of both G and D:

**1. Discriminator update:**
- Sample a real batch x from the dataset
- Sample z ~ N(0, I) and generate fake samples: `x_fake = G(z)`
- Compute the discriminator BCE loss:

  ```
  L_D = −E[log D(x_real)] − E[log(1 − D(x_fake.detach()))]
  ```

- Zero grad on D optimizer, backpropagate, and update D only

**2. Generator update:**
- Sample a new batch of z ~ N(0, I)
- Compute the non-saturating generator loss:

  ```
  L_G = −E[log D(G(z))]
  ```

- Zero grad on G optimizer, backpropagate, and update G only

> **⬡ Important**
> Always call `.detach()` on `x_fake` before passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step.

#### 1-4: Implement the Sampling Function

In `gan.py`, implement `sample(G, n_samples, latent_dim, device)`:

- Sample `n_samples` noise vectors z from N(0, I) with shape `(n_samples, latent_dim)`
- Pass through G to get generated 2D points
- Return as a NumPy array of shape `(n_samples, 2)`
- Use `torch.no_grad()` to disable gradient tracking during inference

#### 1-5: Training and Evaluation

Run `gan_tutorial.ipynb`. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points.

**Include in your report:**

- G and D training loss curves (on the same plot or side-by-side)
- The Chamfer Distance (CD) value
- A scatter plot of generated 2D points vs. real Swiss-Roll data
- Brief analysis (2–3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability?

---

## Task 2: Deep Convolutional GAN (DCGAN) on MNIST

Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality.

### TODO

#### 2-1: Implement the DCGAN Generator

Implement `DCGenerator` in `task_2_dcgan/network.py` using transposed convolutions to upsample from noise to a full image:

- **Input**: noise vector z of shape `(batch_size, latent_dim, 1, 1)`, where `latent_dim = 100`
- Use `ConvTranspose2d` layers to upsample progressively to `(1, 28, 28)`
- **Channel sequence**: `latent_dim → 256 → 128 → 64 → 1`
- Apply `BatchNorm2d + ReLU` after every `ConvTranspose2d` except the last
- Apply Tanh to the final output

> **⬡ Tip**
> `ConvTranspose2d(kernel_size=4, stride=2, padding=1)` doubles spatial resolution. Use `kernel_size=4, stride=1, padding=0` for the first layer to go from 1×1 to 4×4.

#### 2-2: Implement the DCGAN Discriminator

Implement `DCDiscriminator` in `task_2_dcgan/network.py` using strided convolutions to downsample the input image:

- **Input**: grayscale image of shape `(batch_size, 1, 28, 28)`
- Use `Conv2d` layers to downsample to a single scalar output
- **Channel sequence**: `1 → 64 → 128 → 256 → 1`
- Apply `BatchNorm2d + LeakyReLU` (slope 0.2) after every `Conv2d` except the first and last
- Apply Sigmoid to the final output

> **⬡ Important**
> Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability.

#### 2-3: Implement the DCGAN Training Loop

In `task_2_dcgan/dcgan.py`, implement `train_one_epoch()` which iterates over the full MNIST training set for one epoch. For each mini-batch:

**1. Discriminator update:**
- BCE loss on real images (label = 1) → `L_D_real`
- BCE loss on fake images G(z) (label = 0) → `L_D_fake`
- `L_D = L_D_real + L_D_fake` → `zero_grad`, `backward`, `step` D optimizer

**2. Generator update:**
- Generate new fake images and compute: `L_G = BCE(D(G(z)), 1)`
- `zero_grad`, `backward`, `step` G optimizer

#### 2-4: Weight Initialization

Implement `weights_init()` in `task_2_dcgan/network.py` and apply it via `model.apply(weights_init)`:

- `Conv2d` and `ConvTranspose2d`: initialize weights ~ N(0, 0.02)
- `BatchNorm2d`: initialize weights ~ N(1.0, 0.02), bias = 0
- All other layer types: leave unchanged

> **⬡ Hint**
> Use `isinstance(m, nn.Conv2d)` to check layer types. Use `torch.nn.init.normal_()` for weight initialization.

#### 2-5: Training and Evaluation

Run `dcgan_tutorial.ipynb`. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4×8 grid of generated digits per epoch, and reports the Fréchet Inception Distance (FID) score.

**Include in your report:**

- G and D training loss curves over all iterations
- A 4×8 grid of generated MNIST digits from your final trained model
- The FID score reported by the notebook
- Brief analysis (2–3 sentences): comment on image quality, diversity, and any observed instability

---

# Combined Submission Instructions

> **Both questions — one zip file — one PDF report**

## What to Submit

You will submit everything — both Question 1 (DDPM) and Question 2 (GAN) — in a single zip file. There is no separate submission per question.

### Zip File Structure

Your zip file must follow this exact folder layout:

```
{NAME}_{STUDENT_ID}.zip
├── ddpm_assignment/
│   ├── 2d_plot_diffusion_todo/
│   │   ├── network.py                <-- Your implementation
│   │   └── ddpm.py                   <-- Your implementation
│   └── task_1_controlnet/
│       └── diffusion/
│           ├── controlnet.py         <-- Your implementation
│           └── unets/
│               └── unet_2d_condition.py   <-- Your implementation
│
├── gan_assignment/
│   ├── task_1_vanilla_gan/
│   │   ├── network.py                <-- Your implementation
│   │   └── gan.py                    <-- Your implementation
│   └── task_2_dcgan/
│       ├── network.py                <-- Your implementation
│       └── dcgan.py                  <-- Your implementation
│
└── {NAME}_{STUDENT_ID}.pdf           <-- Combined report
```

### Combined PDF Report

Write one single PDF report named `{NAME}_{STUDENT_ID}.pdf` that covers both questions. The report **must not exceed 5 pages** (excluding references). It should contain the following sections in order:

**Section 1 — DDPM (Question 1):**
- Task 1: Training loss curve, CD value, particle visualization, and 2–3 sentence analysis
- Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis

**Section 2 — GAN (Question 2):**
- Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2–3 sentence analysis
- Task 2: G and D loss curves, 4×8 generated MNIST grid, FID score, and 2–3 sentence analysis

### Naming Convention

**Do NOT include in your zip:**
- Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.)
- Model checkpoints (`.pth`, `.ckpt` files)
- Generated image folders
- Pretrained model weights (e.g., the Stable Diffusion checkpoint)

| Item       | Format                                                       |
|------------|--------------------------------------------------------------|
| Zip file   | `{NAME}_{STUDENT_ID}.zip` — e.g. `JOHN_DOE_2024001.zip`      |
| PDF report | `{NAME}_{STUDENT_ID}.pdf` — e.g. `JOHN_DOE_2024001.pdf`      |

---

## Academic Integrity

You may consult the following reference papers while working on this assignment:

- Ho et al. (2020). *Denoising Diffusion Probabilistic Models.*
- Zhang et al. (2023). *Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).*
- Goodfellow et al. (2014). *Generative Adversarial Networks.*
- Radford et al. (2015). *Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).*

> It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment.