| # GEN AI β Programming Assignment |
|
|
| ## Generative Models |
|
|
| --- |
|
|
| ## Submission Policy |
|
|
| Both questions must be submitted together in a **SINGLE zip file** named: |
|
|
| ``` |
| {NAME}_{STUDENT_ID}.zip |
| ``` |
|
|
| The zip file must contain all code folders for both questions and one combined PDF report. |
|
|
| **Do NOT include** datasets, model checkpoints, or large binary files. |
|
|
| --- |
|
|
| # Question 1 β’ 25 Marks |
|
|
| ## Denoising Diffusion Probabilistic Models (DDPM) |
|
|
| > Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning. |
|
|
| ### Environment Setup |
|
|
| Create a conda environment named `ddpm` and install PyTorch: |
|
|
| ```bash |
| conda create --name ddpm python=3.10 |
| conda activate ddpm |
| conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Code Structure |
|
|
| ``` |
| ddpm_assignment/ |
| βββ 2d_plot_diffusion_todo/ (Task 1) |
| β βββ ddpm_tutorial.ipynb <-- Main notebook |
| β βββ dataset.py <-- Swiss-roll, moon, gaussians |
| β βββ network.py <-- (TODO) Noise prediction network |
| β βββ ddpm.py <-- (TODO) DDPM pipeline |
| β |
| βββ task_1_controlnet/ (Task 2) |
| β βββ diffusion/ |
| β β βββ unets/ |
| β β β βββ unet_2d_condition.py <-- (TODO) Integrate ControlNet into UNet |
| β β β βββ unet_2d_blocks.py <-- Basic UNet components |
| β β βββ controlnet.py <-- (TODO) Implement ControlNet |
| β β βββ pipeline_controlnet.py <-- Diffusion pipeline with ControlNet |
| β βββ train.py <-- Training code |
| β βββ train.sh <-- Hyperparameter script |
| β βββ inference.ipynb <-- Inference notebook |
| βββ requirements.txt |
| ``` |
|
|
| ### Background |
|
|
| Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise. |
|
|
| A typical DDPM pipeline consists of three components: |
|
|
| - **Forward Process**: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence xβ β xβ β β¦ β x_T |
| - **Reverse Process**: A learned neural network iteratively denoises x_T back to xβ, step by step |
| - **Training Objective**: The network is trained using a simplified noise-matching loss β predicting the noise Ξ΅ added at each step |
|
|
| --- |
|
|
| ## Task 1: Simple DDPM Pipeline with Swiss-Roll |
|
|
| In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images. |
|
|
| After completing your implementation, train the model and evaluate it by running `ddpm_tutorial.ipynb` in the `2d_plot_diffusion_todo` directory. |
|
|
| ### TODO |
|
|
| #### 1-1: Build a Noise Prediction Network |
|
|
| Implement the noise prediction network in `network.py`. The network takes a noisy data point and a timestep embedding as input, and predicts the noise Ξ΅ added at that step. It should consist of `TimeLinear` layers with feature dimensions: |
|
|
| ``` |
| [dim_in, dim_hids[0], ..., dim_hids[-1], dim_out] |
| ``` |
|
|
| - Every `TimeLinear` layer except the final output layer must be followed by a ReLU activation |
| - The final layer has no activation β it directly outputs the predicted noise |
|
|
| > **⬑ Hint** |
| > `TimeLinear` is a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer. |
|
|
| #### 1-2: Construct the Forward and Reverse Process |
|
|
| In `ddpm.py`, implement the three core functions of the DDPM pipeline: |
|
|
| - **`q_sample(x_0, t, noise)`**: The forward process. Given a clean sample xβ and timestep t, return the noised sample x_t using the closed-form formula: |
| |
| ``` |
| x_t = βαΎ±_t Β· xβ + β(1 β αΎ±_t) Β· Ξ΅, where Ξ΅ ~ N(0, I) |
| ``` |
| |
| - **`p_sample(x_t, t)`**: One-step reverse transition. Use the trained network to predict Ξ΅, then compute the denoised estimate of x_{t-1} |
| |
| - **`p_sample_loop(shape)`**: Full reverse process. Starting from x_T ~ N(0, I), iterate `p_sample()` from t=T down to t=1 and return the final sample xβ |
| |
| > **⬑ Important** |
| > Use the pre-computed noise schedule (Ξ±_t, αΎ±_t, Ξ²_t) provided in the starter code. Do not redefine the schedule inside these functions. |
| |
| #### 1-3: Implement the Training Objective |
| |
| In `ddpm.py`, implement `compute_loss()`. This function should: |
| |
| 1. Sample a random timestep t uniformly from {1, β¦, T} for each element in the batch |
| 2. Sample noise Ξ΅ ~ N(0, I) of the same shape as the input xβ |
| 3. Compute the noised sample x_t using `q_sample()` |
| 4. Pass x_t and t to the noise prediction network to obtain the predicted noise Ξ΅Μ |
| 5. Return the simplified noise-matching loss: **L = ||Ξ΅ β Ξ΅Μ||Β²** |
| |
| #### 1-4: Training and Evaluation |
| |
| Once your implementation is complete, open and run `ddpm_tutorial.ipynb` via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution. |
| |
| **Include in your report:** |
| |
| - The training loss curve |
| - The Chamfer Distance (CD) value reported after running the notebook |
| - A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution |
| |
| --- |
| |
| ## Task 2: ControlNet on Fill50K Dataset |
| |
| In this task, you will implement ControlNet β a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers. |
| |
| ### Prerequisites: Hugging Face Setup |
| |
| Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model: |
| |
| - Sign into Hugging Face at https://huggingface.co |
| - Obtain your Access Token at https://huggingface.co/settings/tokens |
| - Log in from your terminal: |
| |
| ```bash |
| $ huggingface-cli login |
| ``` |
| |
| Install the ControlNet environment: |
| |
| ```bash |
| conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia |
| pip install -r requirements.txt |
| ``` |
| |
| Verify your setup by generating a test image with Stable Diffusion: |
| |
| ```python |
| import torch |
| from diffusers import StableDiffusionPipeline |
|
|
| model_id = "CompVis/stable-diffusion-v1-4" |
| pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") |
| image = pipe("a photo of an astronaut riding a horse on mars").images[0] |
| image.save("test.png") |
| ``` |
| |
| ### TODO |
| |
| #### Task 0: Generate Baseline Images |
| |
| Using the 5 text prompts in `./task_1_controlnet/data/test_prompts.json`, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report. |
| |
| #### 2-1: Implement Zero-Convolution |
| |
| In `diffusion/controlnet.py` (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1Γ1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs. |
| |
| > **⬑ Hint** |
| > Use `nn.Conv2d(channels, channels, kernel_size=1)` and explicitly set `weight.data` and `bias.data` to zero after initialization. |
| |
| #### 2-2: Initialize ControlNet from Pretrained UNet |
| |
| In `diffusion/controlnet.py` (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch. |
| |
| #### 2-3: Apply Zero-Convolution to Residual Features |
| |
| In `diffusion/controlnet.py` (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute: |
| |
| ``` |
| h_out = ZeroConv(h) |
| ``` |
| |
| #### 2-4: Integrate ControlNet Outputs into UNet |
| |
| In `diffusion/unets/unet_2d_condition.py` (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input: |
| |
| ``` |
| decoder_input = decoder_input + controlnet_residual |
| ``` |
| |
| > **⬑ Important** |
| > Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling. |
| |
| #### 2-5: Train and Evaluate |
| |
| Train ControlNet on the Fill50K dataset (automatically downloaded by the `load_dataset()` function in `train.py`) by running: |
| |
| ```bash |
| $ sh train.sh |
| ``` |
| |
| Then, run `inference.ipynb` to generate images conditioned on 5 different edge maps from `./data/test_conditions`, using the text prompts in `data/test_prompts.json`. |
| |
| **Include in your report:** |
| |
| - The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts |
| - The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images |
| - A brief analysis of each condition: does the generated image accurately follow the edge map? |
| |
| --- |
| |
| # Question 2 β’ 25 Marks |
| |
| ## Generative Adversarial Networks (GAN) |
| |
| > Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits. |
| |
| ### Environment Setup |
| |
| Create a conda environment named `gan_assignment` and install the required packages: |
| |
| ```bash |
| conda create --name gan_assignment python=3.10 |
| conda activate gan_assignment |
| conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch |
| pip install -r requirements.txt |
| ``` |
| |
| The `requirements.txt` includes: `numpy`, `matplotlib`, `scipy`, `tqdm`, and `jupyter`. |
| |
| ### Code Structure |
| |
| ``` |
| gan_assignment/ |
| βββ task_1_vanilla_gan/ (Task 1) |
| β βββ gan_tutorial.ipynb <-- Main notebook |
| β βββ dataset.py <-- 2D toy dataset definitions |
| β βββ network.py <-- (TODO) Generator & Discriminator |
| β βββ gan.py <-- (TODO) GAN training pipeline |
| β |
| βββ task_2_dcgan/ (Task 2) |
| β βββ dcgan_tutorial.ipynb <-- Main notebook |
| β βββ network.py <-- (TODO) DCGAN architecture |
| β βββ dcgan.py <-- (TODO) DCGAN training loop |
| βββ requirements.txt |
| ``` |
| |
| --- |
| |
| ## Task 1: Vanilla GAN on 2D Swiss-Roll Data |
| |
| Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation. |
| |
| ### TODO |
| |
| #### 1-1: Build the Generator Network |
| |
| Implement the `Generator` class in `network.py`. The Generator maps a noise vector z to a 2D output point: |
| |
| - **Input**: noise vector z of shape `(batch_size, latent_dim)`, with `latent_dim = 16` by default |
| - **Architecture**: fully-connected layers with dimensions `[latent_dim, dim_hids[0], β¦, dim_hids[-1], 2]` |
| - **Activation**: ReLU after every hidden layer (except the final output layer) |
| - **Output**: 2D point of shape `(batch_size, 2)` with a Tanh activation on the last layer |
| |
| > **⬑ Hint** |
| > Use `nn.Sequential` or `nn.ModuleList` to stack your layers. |
| |
| #### 1-2: Build the Discriminator Network |
| |
| Implement the `Discriminator` class in `network.py`. The Discriminator takes a 2D point and outputs a real/fake probability: |
| |
| - **Input**: a 2D point of shape `(batch_size, 2)` |
| - **Architecture**: fully-connected layers with dimensions `[2, dim_hids[0], β¦, dim_hids[-1], 1]` |
| - **Activation**: LeakyReLU (negative slope = 0.2) after every hidden layer |
| - **Output**: a scalar of shape `(batch_size, 1)` with a Sigmoid activation to produce a probability in [0, 1] |
| |
| #### 1-3: Implement the GAN Training Step |
| |
| In `gan.py`, implement the `train_step()` function which performs one full update of both G and D: |
| |
| **1. Discriminator update:** |
| - Sample a real batch x from the dataset |
| - Sample z ~ N(0, I) and generate fake samples: `x_fake = G(z)` |
| - Compute the discriminator BCE loss: |
| |
| ``` |
| L_D = βE[log D(x_real)] β E[log(1 β D(x_fake.detach()))] |
| ``` |
| |
| - Zero grad on D optimizer, backpropagate, and update D only |
| |
| **2. Generator update:** |
| - Sample a new batch of z ~ N(0, I) |
| - Compute the non-saturating generator loss: |
| |
| ``` |
| L_G = βE[log D(G(z))] |
| ``` |
| |
| - Zero grad on G optimizer, backpropagate, and update G only |
| |
| > **⬑ Important** |
| > Always call `.detach()` on `x_fake` before passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step. |
| |
| #### 1-4: Implement the Sampling Function |
| |
| In `gan.py`, implement `sample(G, n_samples, latent_dim, device)`: |
| |
| - Sample `n_samples` noise vectors z from N(0, I) with shape `(n_samples, latent_dim)` |
| - Pass through G to get generated 2D points |
| - Return as a NumPy array of shape `(n_samples, 2)` |
| - Use `torch.no_grad()` to disable gradient tracking during inference |
| |
| #### 1-5: Training and Evaluation |
| |
| Run `gan_tutorial.ipynb`. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points. |
| |
| **Include in your report:** |
| |
| - G and D training loss curves (on the same plot or side-by-side) |
| - The Chamfer Distance (CD) value |
| - A scatter plot of generated 2D points vs. real Swiss-Roll data |
| - Brief analysis (2β3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability? |
| |
| --- |
| |
| ## Task 2: Deep Convolutional GAN (DCGAN) on MNIST |
| |
| Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality. |
| |
| ### TODO |
| |
| #### 2-1: Implement the DCGAN Generator |
| |
| Implement `DCGenerator` in `task_2_dcgan/network.py` using transposed convolutions to upsample from noise to a full image: |
| |
| - **Input**: noise vector z of shape `(batch_size, latent_dim, 1, 1)`, where `latent_dim = 100` |
| - Use `ConvTranspose2d` layers to upsample progressively to `(1, 28, 28)` |
| - **Channel sequence**: `latent_dim β 256 β 128 β 64 β 1` |
| - Apply `BatchNorm2d + ReLU` after every `ConvTranspose2d` except the last |
| - Apply Tanh to the final output |
| |
| > **⬑ Tip** |
| > `ConvTranspose2d(kernel_size=4, stride=2, padding=1)` doubles spatial resolution. Use `kernel_size=4, stride=1, padding=0` for the first layer to go from 1Γ1 to 4Γ4. |
| |
| #### 2-2: Implement the DCGAN Discriminator |
| |
| Implement `DCDiscriminator` in `task_2_dcgan/network.py` using strided convolutions to downsample the input image: |
| |
| - **Input**: grayscale image of shape `(batch_size, 1, 28, 28)` |
| - Use `Conv2d` layers to downsample to a single scalar output |
| - **Channel sequence**: `1 β 64 β 128 β 256 β 1` |
| - Apply `BatchNorm2d + LeakyReLU` (slope 0.2) after every `Conv2d` except the first and last |
| - Apply Sigmoid to the final output |
| |
| > **⬑ Important** |
| > Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability. |
| |
| #### 2-3: Implement the DCGAN Training Loop |
| |
| In `task_2_dcgan/dcgan.py`, implement `train_one_epoch()` which iterates over the full MNIST training set for one epoch. For each mini-batch: |
| |
| **1. Discriminator update:** |
| - BCE loss on real images (label = 1) β `L_D_real` |
| - BCE loss on fake images G(z) (label = 0) β `L_D_fake` |
| - `L_D = L_D_real + L_D_fake` β `zero_grad`, `backward`, `step` D optimizer |
| |
| **2. Generator update:** |
| - Generate new fake images and compute: `L_G = BCE(D(G(z)), 1)` |
| - `zero_grad`, `backward`, `step` G optimizer |
| |
| #### 2-4: Weight Initialization |
| |
| Implement `weights_init()` in `task_2_dcgan/network.py` and apply it via `model.apply(weights_init)`: |
| |
| - `Conv2d` and `ConvTranspose2d`: initialize weights ~ N(0, 0.02) |
| - `BatchNorm2d`: initialize weights ~ N(1.0, 0.02), bias = 0 |
| - All other layer types: leave unchanged |
| |
| > **⬑ Hint** |
| > Use `isinstance(m, nn.Conv2d)` to check layer types. Use `torch.nn.init.normal_()` for weight initialization. |
| |
| #### 2-5: Training and Evaluation |
| |
| Run `dcgan_tutorial.ipynb`. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4Γ8 grid of generated digits per epoch, and reports the FrΓ©chet Inception Distance (FID) score. |
| |
| **Include in your report:** |
| |
| - G and D training loss curves over all iterations |
| - A 4Γ8 grid of generated MNIST digits from your final trained model |
| - The FID score reported by the notebook |
| - Brief analysis (2β3 sentences): comment on image quality, diversity, and any observed instability |
| |
| --- |
| |
| # Combined Submission Instructions |
| |
| > **Both questions β one zip file β one PDF report** |
| |
| ## What to Submit |
| |
| You will submit everything β both Question 1 (DDPM) and Question 2 (GAN) β in a single zip file. There is no separate submission per question. |
| |
| ### Zip File Structure |
| |
| Your zip file must follow this exact folder layout: |
| |
| ``` |
| {NAME}_{STUDENT_ID}.zip |
| βββ ddpm_assignment/ |
| β βββ 2d_plot_diffusion_todo/ |
| β β βββ network.py <-- Your implementation |
| β β βββ ddpm.py <-- Your implementation |
| β βββ task_1_controlnet/ |
| β βββ diffusion/ |
| β βββ controlnet.py <-- Your implementation |
| β βββ unets/ |
| β βββ unet_2d_condition.py <-- Your implementation |
| β |
| βββ gan_assignment/ |
| β βββ task_1_vanilla_gan/ |
| β β βββ network.py <-- Your implementation |
| β β βββ gan.py <-- Your implementation |
| β βββ task_2_dcgan/ |
| β βββ network.py <-- Your implementation |
| β βββ dcgan.py <-- Your implementation |
| β |
| βββ {NAME}_{STUDENT_ID}.pdf <-- Combined report |
| ``` |
| |
| ### Combined PDF Report |
| |
| Write one single PDF report named `{NAME}_{STUDENT_ID}.pdf` that covers both questions. The report **must not exceed 5 pages** (excluding references). It should contain the following sections in order: |
| |
| **Section 1 β DDPM (Question 1):** |
| - Task 1: Training loss curve, CD value, particle visualization, and 2β3 sentence analysis |
| - Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis |
| |
| **Section 2 β GAN (Question 2):** |
| - Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2β3 sentence analysis |
| - Task 2: G and D loss curves, 4Γ8 generated MNIST grid, FID score, and 2β3 sentence analysis |
| |
| ### Naming Convention |
| |
| **Do NOT include in your zip:** |
| - Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.) |
| - Model checkpoints (`.pth`, `.ckpt` files) |
| - Generated image folders |
| - Pretrained model weights (e.g., the Stable Diffusion checkpoint) |
| |
| | Item | Format | |
| |------------|--------------------------------------------------------------| |
| | Zip file | `{NAME}_{STUDENT_ID}.zip` β e.g. `JOHN_DOE_2024001.zip` | |
| | PDF report | `{NAME}_{STUDENT_ID}.pdf` β e.g. `JOHN_DOE_2024001.pdf` | |
| |
| --- |
| |
| ## Academic Integrity |
| |
| You may consult the following reference papers while working on this assignment: |
| |
| - Ho et al. (2020). *Denoising Diffusion Probabilistic Models.* |
| - Zhang et al. (2023). *Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).* |
| - Goodfellow et al. (2014). *Generative Adversarial Networks.* |
| - Radford et al. (2015). *Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).* |
| |
| > It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment. |