Add files using upload-large-folder tool

2ec1243 verified 12 days ago

20 kB

	# GEN AI — Programming Assignment

	## Generative Models

	---

	## Submission Policy

	Both questions must be submitted together in a SINGLE zip file named:

	```
	{NAME}_{STUDENT_ID}.zip
	```

	The zip file must contain all code folders for both questions and one combined PDF report.

	Do NOT include datasets, model checkpoints, or large binary files.

	---

	# Question 1 • 25 Marks

	## Denoising Diffusion Probabilistic Models (DDPM)

	> Implement DDPM from scratch: forward/reverse process, training objective, and ControlNet conditioning.

	### Environment Setup

	Create a conda environment named `ddpm` and install PyTorch:

	```bash
	conda create --name ddpm python=3.10
	conda activate ddpm
	conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
	pip install -r requirements.txt
	```

	### Code Structure

	```
	ddpm_assignment/
	├── 2d_plot_diffusion_todo/ (Task 1)
	│ ├── ddpm_tutorial.ipynb <-- Main notebook
	│ ├── dataset.py <-- Swiss-roll, moon, gaussians
	│ ├── network.py <-- (TODO) Noise prediction network
	│ └── ddpm.py <-- (TODO) DDPM pipeline
	│
	├── task_1_controlnet/ (Task 2)
	│ ├── diffusion/
	│ │ ├── unets/
	│ │ │ ├── unet_2d_condition.py <-- (TODO) Integrate ControlNet into UNet
	│ │ │ └── unet_2d_blocks.py <-- Basic UNet components
	│ │ ├── controlnet.py <-- (TODO) Implement ControlNet
	│ │ └── pipeline_controlnet.py <-- Diffusion pipeline with ControlNet
	│ ├── train.py <-- Training code
	│ ├── train.sh <-- Hyperparameter script
	│ └── inference.ipynb <-- Inference notebook
	└── requirements.txt
	```

	### Background

	Denoising Diffusion Probabilistic Models (DDPMs) are a class of generative models that learn to reverse a gradual noising process. The model is trained to predict the noise added to data at each step, and generates new samples by iteratively denoising from pure Gaussian noise.

	A typical DDPM pipeline consists of three components:

	- Forward Process: Gradually adds Gaussian noise to a data sample over T timesteps, producing a sequence x₀ → x₁ → … → x_T
	- Reverse Process: A learned neural network iteratively denoises x_T back to x₀, step by step
	- Training Objective: The network is trained using a simplified noise-matching loss — predicting the noise ε added at each step

	---

	## Task 1: Simple DDPM Pipeline with Swiss-Roll

	In this task, you will implement a DDPM to learn a 2D Swiss-Roll distribution. This toy experiment lets you understand each component of the diffusion pipeline before scaling to images.

	After completing your implementation, train the model and evaluate it by running `ddpm_tutorial.ipynb` in the `2d_plot_diffusion_todo` directory.

	### TODO

	#### 1-1: Build a Noise Prediction Network

	Implement the noise prediction network in `network.py`. The network takes a noisy data point and a timestep embedding as input, and predicts the noise ε added at that step. It should consist of `TimeLinear` layers with feature dimensions:

	```
	[dim_in, dim_hids[0], ..., dim_hids[-1], dim_out]
	```

	- Every `TimeLinear` layer except the final output layer must be followed by a ReLU activation
	- The final layer has no activation — it directly outputs the predicted noise

	> ⬡ Hint
	> `TimeLinear` is a linear layer that is conditioned on a sinusoidal timestep embedding. The timestep embedding is added to the hidden features before the activation at each layer.

	#### 1-2: Construct the Forward and Reverse Process

	In `ddpm.py`, implement the three core functions of the DDPM pipeline:

	- `q_sample(x_0, t, noise)`: The forward process. Given a clean sample x₀ and timestep t, return the noised sample x_t using the closed-form formula:

	```
	x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε, where ε ~ N(0, I)
	```

	- `p_sample(x_t, t)`: One-step reverse transition. Use the trained network to predict ε, then compute the denoised estimate of x_{t-1}

	- `p_sample_loop(shape)`: Full reverse process. Starting from x_T ~ N(0, I), iterate `p_sample()` from t=T down to t=1 and return the final sample x₀

	> ⬡ Important
	> Use the pre-computed noise schedule (α_t, ᾱ_t, β_t) provided in the starter code. Do not redefine the schedule inside these functions.

	#### 1-3: Implement the Training Objective

	In `ddpm.py`, implement `compute_loss()`. This function should:

	1. Sample a random timestep t uniformly from {1, …, T} for each element in the batch
	2. Sample noise ε ~ N(0, I) of the same shape as the input x₀
	3. Compute the noised sample x_t using `q_sample()`
	4. Pass x_t and t to the noise prediction network to obtain the predicted noise ε̂
	5. Return the simplified noise-matching loss: L = \|\|ε − ε̂\|\|²

	#### 1-4: Training and Evaluation

	Once your implementation is complete, open and run `ddpm_tutorial.ipynb` via Jupyter Notebook. The notebook will automatically train the diffusion model and measure the Chamfer Distance (CD) between 2D particles sampled by the model and particles from the true Swiss-Roll distribution.

	Include in your report:

	- The training loss curve
	- The Chamfer Distance (CD) value reported after running the notebook
	- A visualization of the sampled 2D particles vs. the real Swiss-Roll distribution

	---

	## Task 2: ControlNet on Fill50K Dataset

	In this task, you will implement ControlNet — a method that adds spatial conditioning (e.g., edge maps) to a pretrained Stable Diffusion model by attaching trainable copied encoder blocks with zero-convolution layers.

	### Prerequisites: Hugging Face Setup

	Before beginning, set up Hugging Face access to download the pretrained Stable Diffusion model:

	- Sign into Hugging Face at https://huggingface.co
	- Obtain your Access Token at https://huggingface.co/settings/tokens
	- Log in from your terminal:

	```bash
	$ huggingface-cli login
	```

	Install the ControlNet environment:

	```bash
	conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
	pip install -r requirements.txt
	```

	Verify your setup by generating a test image with Stable Diffusion:

	```python
	import torch
	from diffusers import StableDiffusionPipeline

	model_id = "CompVis/stable-diffusion-v1-4"
	pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
	image = pipe("a photo of an astronaut riding a horse on mars").images[0]
	image.save("test.png")
	```

	### TODO

	#### Task 0: Generate Baseline Images

	Using the 5 text prompts in `./task_1_controlnet/data/test_prompts.json`, generate 5 baseline images with the pretrained Stable Diffusion model (without ControlNet). These will serve as your comparison baseline in the report.

	#### 2-1: Implement Zero-Convolution

	In `diffusion/controlnet.py` (TODO 1), implement the zero-convolution operation. A zero-convolution is a 1×1 convolution layer whose weights and biases are both initialized to zero at the start of training. This ensures that ControlNet begins training without disrupting the pretrained Stable Diffusion outputs.

	> ⬡ Hint
	> Use `nn.Conv2d(channels, channels, kernel_size=1)` and explicitly set `weight.data` and `bias.data` to zero after initialization.

	#### 2-2: Initialize ControlNet from Pretrained UNet

	In `diffusion/controlnet.py` (TODO 2), initialize the ControlNet encoder by copying weights from the pretrained UNet encoder blocks. This transfer learning approach allows ControlNet to start from a strong pretrained feature extractor rather than training from scratch.

	#### 2-3: Apply Zero-Convolution to Residual Features

	In `diffusion/controlnet.py` (TODO 3), apply the zero-convolution layers to the residual feature maps output by each ControlNet encoder block before they are passed to the UNet decoder. Specifically, for each block output h, compute:

	```
	h_out = ZeroConv(h)
	```

	#### 2-4: Integrate ControlNet Outputs into UNet

	In `diffusion/unets/unet_2d_condition.py` (TODO 4), modify the UNet decoder to add the ControlNet residual features to the corresponding UNet decoder skip connections. Each ControlNet block output is added element-wise to the matching UNet decoder input:

	```
	decoder_input = decoder_input + controlnet_residual
	```

	> ⬡ Important
	> Do not apply any additional normalization to the ControlNet residuals before adding them to the UNet features. The zero-convolution already handles the initial scaling.

	#### 2-5: Train and Evaluate

	Train ControlNet on the Fill50K dataset (automatically downloaded by the `load_dataset()` function in `train.py`) by running:

	```bash
	$ sh train.sh
	```

	Then, run `inference.ipynb` to generate images conditioned on 5 different edge maps from `./data/test_conditions`, using the text prompts in `data/test_prompts.json`.

	Include in your report:

	- The 5 baseline images generated by Stable Diffusion (Task 0) with their text prompts
	- The 5 condition inputs (edge maps), corresponding text prompts, and ControlNet-generated images
	- A brief analysis of each condition: does the generated image accurately follow the edge map?

	---

	# Question 2 • 25 Marks

	## Generative Adversarial Networks (GAN)

	> Implement a Vanilla GAN on 2D Swiss-Roll data and a DCGAN on MNIST handwritten digits.

	### Environment Setup

	Create a conda environment named `gan_assignment` and install the required packages:

	```bash
	conda create --name gan_assignment python=3.10
	conda activate gan_assignment
	conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
	pip install -r requirements.txt
	```

	The `requirements.txt` includes: `numpy`, `matplotlib`, `scipy`, `tqdm`, and `jupyter`.

	### Code Structure

	```
	gan_assignment/
	├── task_1_vanilla_gan/ (Task 1)
	│ ├── gan_tutorial.ipynb <-- Main notebook
	│ ├── dataset.py <-- 2D toy dataset definitions
	│ ├── network.py <-- (TODO) Generator & Discriminator
	│ └── gan.py <-- (TODO) GAN training pipeline
	│
	├── task_2_dcgan/ (Task 2)
	│ ├── dcgan_tutorial.ipynb <-- Main notebook
	│ ├── network.py <-- (TODO) DCGAN architecture
	│ └── dcgan.py <-- (TODO) DCGAN training loop
	└── requirements.txt
	```

	---

	## Task 1: Vanilla GAN on 2D Swiss-Roll Data

	Implement a fully-connected GAN to learn a 2D Swiss-Roll distribution. This toy experiment gives you hands-on experience with the adversarial training loop before scaling to image generation.

	### TODO

	#### 1-1: Build the Generator Network

	Implement the `Generator` class in `network.py`. The Generator maps a noise vector z to a 2D output point:

	- Input: noise vector z of shape `(batch_size, latent_dim)`, with `latent_dim = 16` by default
	- Architecture: fully-connected layers with dimensions `[latent_dim, dim_hids[0], …, dim_hids[-1], 2]`
	- Activation: ReLU after every hidden layer (except the final output layer)
	- Output: 2D point of shape `(batch_size, 2)` with a Tanh activation on the last layer

	> ⬡ Hint
	> Use `nn.Sequential` or `nn.ModuleList` to stack your layers.

	#### 1-2: Build the Discriminator Network

	Implement the `Discriminator` class in `network.py`. The Discriminator takes a 2D point and outputs a real/fake probability:

	- Input: a 2D point of shape `(batch_size, 2)`
	- Architecture: fully-connected layers with dimensions `[2, dim_hids[0], …, dim_hids[-1], 1]`
	- Activation: LeakyReLU (negative slope = 0.2) after every hidden layer
	- Output: a scalar of shape `(batch_size, 1)` with a Sigmoid activation to produce a probability in [0, 1]

	#### 1-3: Implement the GAN Training Step

	In `gan.py`, implement the `train_step()` function which performs one full update of both G and D:

	1. Discriminator update:
	- Sample a real batch x from the dataset
	- Sample z ~ N(0, I) and generate fake samples: `x_fake = G(z)`
	- Compute the discriminator BCE loss:

	```
	L_D = −E[log D(x_real)] − E[log(1 − D(x_fake.detach()))]
	```

	- Zero grad on D optimizer, backpropagate, and update D only

	2. Generator update:
	- Sample a new batch of z ~ N(0, I)
	- Compute the non-saturating generator loss:

	```
	L_G = −E[log D(G(z))]
	```

	- Zero grad on G optimizer, backpropagate, and update G only

	> ⬡ Important
	> Always call `.detach()` on `x_fake` before passing it to D during the discriminator update. This stops gradients from flowing back into G during D's update step.

	#### 1-4: Implement the Sampling Function

	In `gan.py`, implement `sample(G, n_samples, latent_dim, device)`:

	- Sample `n_samples` noise vectors z from N(0, I) with shape `(n_samples, latent_dim)`
	- Pass through G to get generated 2D points
	- Return as a NumPy array of shape `(n_samples, 2)`
	- Use `torch.no_grad()` to disable gradient tracking during inference

	#### 1-5: Training and Evaluation

	Run `gan_tutorial.ipynb`. The notebook trains the GAN for 5000 iterations and reports the Chamfer Distance (CD) between generated and real Swiss-Roll points.

	Include in your report:

	- G and D training loss curves (on the same plot or side-by-side)
	- The Chamfer Distance (CD) value
	- A scatter plot of generated 2D points vs. real Swiss-Roll data
	- Brief analysis (2–3 sentences): did the GAN learn the distribution? Did you observe mode collapse or instability?

	---

	## Task 2: Deep Convolutional GAN (DCGAN) on MNIST

	Implement a DCGAN to generate handwritten digit images. DCGAN replaces fully-connected layers with convolutional layers, significantly improving image generation quality.

	### TODO

	#### 2-1: Implement the DCGAN Generator

	Implement `DCGenerator` in `task_2_dcgan/network.py` using transposed convolutions to upsample from noise to a full image:

	- Input: noise vector z of shape `(batch_size, latent_dim, 1, 1)`, where `latent_dim = 100`
	- Use `ConvTranspose2d` layers to upsample progressively to `(1, 28, 28)`
	- Channel sequence: `latent_dim → 256 → 128 → 64 → 1`
	- Apply `BatchNorm2d + ReLU` after every `ConvTranspose2d` except the last
	- Apply Tanh to the final output

	> ⬡ Tip
	> `ConvTranspose2d(kernel_size=4, stride=2, padding=1)` doubles spatial resolution. Use `kernel_size=4, stride=1, padding=0` for the first layer to go from 1×1 to 4×4.

	#### 2-2: Implement the DCGAN Discriminator

	Implement `DCDiscriminator` in `task_2_dcgan/network.py` using strided convolutions to downsample the input image:

	- Input: grayscale image of shape `(batch_size, 1, 28, 28)`
	- Use `Conv2d` layers to downsample to a single scalar output
	- Channel sequence: `1 → 64 → 128 → 256 → 1`
	- Apply `BatchNorm2d + LeakyReLU` (slope 0.2) after every `Conv2d` except the first and last
	- Apply Sigmoid to the final output

	> ⬡ Important
	> Do NOT apply BatchNorm to the first layer of the discriminator (raw pixel input) or the last layer. This is standard DCGAN practice for training stability.

	#### 2-3: Implement the DCGAN Training Loop

	In `task_2_dcgan/dcgan.py`, implement `train_one_epoch()` which iterates over the full MNIST training set for one epoch. For each mini-batch:

	1. Discriminator update:
	- BCE loss on real images (label = 1) → `L_D_real`
	- BCE loss on fake images G(z) (label = 0) → `L_D_fake`
	- `L_D = L_D_real + L_D_fake` → `zero_grad`, `backward`, `step` D optimizer

	2. Generator update:
	- Generate new fake images and compute: `L_G = BCE(D(G(z)), 1)`
	- `zero_grad`, `backward`, `step` G optimizer

	#### 2-4: Weight Initialization

	Implement `weights_init()` in `task_2_dcgan/network.py` and apply it via `model.apply(weights_init)`:

	- `Conv2d` and `ConvTranspose2d`: initialize weights ~ N(0, 0.02)
	- `BatchNorm2d`: initialize weights ~ N(1.0, 0.02), bias = 0
	- All other layer types: leave unchanged

	> ⬡ Hint
	> Use `isinstance(m, nn.Conv2d)` to check layer types. Use `torch.nn.init.normal_()` for weight initialization.

	#### 2-5: Training and Evaluation

	Run `dcgan_tutorial.ipynb`. The notebook trains DCGAN on MNIST for 20 epochs, shows a 4×8 grid of generated digits per epoch, and reports the Fréchet Inception Distance (FID) score.

	Include in your report:

	- G and D training loss curves over all iterations
	- A 4×8 grid of generated MNIST digits from your final trained model
	- The FID score reported by the notebook
	- Brief analysis (2–3 sentences): comment on image quality, diversity, and any observed instability

	---

	# Combined Submission Instructions

	> Both questions — one zip file — one PDF report

	## What to Submit

	You will submit everything — both Question 1 (DDPM) and Question 2 (GAN) — in a single zip file. There is no separate submission per question.

	### Zip File Structure

	Your zip file must follow this exact folder layout:

	```
	{NAME}_{STUDENT_ID}.zip
	├── ddpm_assignment/
	│ ├── 2d_plot_diffusion_todo/
	│ │ ├── network.py <-- Your implementation
	│ │ └── ddpm.py <-- Your implementation
	│ └── task_1_controlnet/
	│ └── diffusion/
	│ ├── controlnet.py <-- Your implementation
	│ └── unets/
	│ └── unet_2d_condition.py <-- Your implementation
	│
	├── gan_assignment/
	│ ├── task_1_vanilla_gan/
	│ │ ├── network.py <-- Your implementation
	│ │ └── gan.py <-- Your implementation
	│ └── task_2_dcgan/
	│ ├── network.py <-- Your implementation
	│ └── dcgan.py <-- Your implementation
	│
	└── {NAME}_{STUDENT_ID}.pdf <-- Combined report
	```

	### Combined PDF Report

	Write one single PDF report named `{NAME}_{STUDENT_ID}.pdf` that covers both questions. The report must not exceed 5 pages (excluding references). It should contain the following sections in order:

	Section 1 — DDPM (Question 1):
	- Task 1: Training loss curve, CD value, particle visualization, and 2–3 sentence analysis
	- Task 2: 5 baseline SD images, 5 ControlNet results (condition + generated), and per-condition analysis

	Section 2 — GAN (Question 2):
	- Task 1: G and D loss curves, CD value, scatter plot of generated vs. real 2D points, and 2–3 sentence analysis
	- Task 2: G and D loss curves, 4×8 generated MNIST grid, FID score, and 2–3 sentence analysis

	### Naming Convention

	Do NOT include in your zip:
	- Datasets or downloaded data folders (MNIST, Swiss-Roll, Fill50K, etc.)
	- Model checkpoints (`.pth`, `.ckpt` files)
	- Generated image folders
	- Pretrained model weights (e.g., the Stable Diffusion checkpoint)

	\| Item \| Format \|
	\|------------\|--------------------------------------------------------------\|
	\| Zip file \| `{NAME}_{STUDENT_ID}.zip` — e.g. `JOHN_DOE_2024001.zip` \|
	\| PDF report \| `{NAME}_{STUDENT_ID}.pdf` — e.g. `JOHN_DOE_2024001.pdf` \|

	---

	## Academic Integrity

	You may consult the following reference papers while working on this assignment:

	- Ho et al. (2020). Denoising Diffusion Probabilistic Models.
	- Zhang et al. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet).
	- Goodfellow et al. (2014). Generative Adversarial Networks.
	- Radford et al. (2015). Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN).

	> It is strictly forbidden to copy, reformat, or directly reproduce code from online repositories or other students. All submitted code must be your own original implementation. Violations will result in a zero for the entire assignment.