|
|
--- |
|
|
license: creativeml-openrail-m |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
pipeline_tag: image-to-image |
|
|
--- |
|
|
# Project Chronicle: A Journey into Virtual Try-On with Diffusion Models |
|
|
|
|
|
This document outlines the development journey of this project, which aims to implement the "TryOnDiffusion: A Tale of Two UNets" paper. It serves as a log of the learning process, implementation steps, challenges faced, and future goals. |
|
|
|
|
|
## Tech Stack |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
[](https://huggingface.co/Aditya757864/TRY_ON) |
|
|
|
|
|
--- |
|
|
|
|
|
## Phase 1: Foundational Learning (The Groundwork) |
|
|
|
|
|
* **Core Concepts:** Started with the fundamentals of **Computer Vision** and mastered the **PyTorch** framework. |
|
|
* **Generative Adversarial Networks (GANs):** Implemented and trained a **POKEGAN** to gain practical experience with generative models. |
|
|
* **Introduction to Diffusion Models:** Shifted focus to diffusion models, successfully training a **Denoising Diffusion Probabilistic Model (DDPM)** on the Fashion MNIST dataset (28x28 images) using an NVIDIA RTX 3090. |
|
|
* **Data Pipeline Mastery:** Revisited and gained a deeper understanding of PyTorch's `DataLoader` and custom data handling pipelines. |
|
|
|
|
|
--- |
|
|
|
|
|
## Phase 2: Advanced Concepts & Paper Selection (Scaling Up) |
|
|
|
|
|
* **Advanced Architectures:** Studied **Transformers** and the **Attention** mechanism to understand how models process long-range dependencies. |
|
|
* **Modulation Techniques:** Explored specific neural network techniques like **Feature-wise Linear Modulation (FiLM)** for conditioning generative models. |
|
|
* **Research & Direction:** After a thorough literature review, the **"TryOnDiffusion: A Tale of Two UNets"** paper was selected as the primary research goal for this project. |
|
|
|
|
|
--- |
|
|
|
|
|
## Phase 3: Implementation, Training, and Debugging (Getting Hands-On) |
|
|
|
|
|
* **Codebase Adaptation:** Forked and analyzed an open-source implementation by **fashnAI** as a starting point. |
|
|
* **Custom Development:** |
|
|
* Engineered a **custom data mapper and `DataLoader`** to process the HR-VITON dataset. |
|
|
* Wrote a **custom trainer script** tailored to the model's specific needs and for better control over the training loop. |
|
|
* **Technical Challenges:** Successfully debugged and resolved several breaking changes caused by library updates in the original repository. |
|
|
* **Model Training:** |
|
|
* Initiated training on a subset of the **HR-VITON dataset (500 images)**. |
|
|
* Utilized an **NVIDIA RTX 4090 (24GB)** for the computationally intensive training process. |
|
|
* Tracked metrics, losses, and logs meticulously using **Weights & Biases (`wandb`)**. |
|
|
* **Evaluation:** Created a **sampling script** to generate image outputs from checkpoints to qualitatively assess model performance. |
|
|
|
|
|
--- |
|
|
|
|
|
## Phase 4: The Plateau & The Path Forward (Current Status) |
|
|
|
|
|
> **Current Challenge:** The model's loss has **stagnated and remains constant**. This suggests the model is no longer learning, likely due to overfitting on the small dataset or a subtle issue in the data pipeline. |
|
|
|
|
|
### Visual Analysis |
|
|
|
|
|
*Sample model output after 2000 epochs.* |
|
|
| Original Input | Input Features | Generated Output | |
|
|
| ----- | ----- | ----- | |
|
|
| <img src="./original.png" alt="Original Input Image" width="300"> | <img src="./imputs.png" alt="Input Features Image" width="80"> | <img src="./our_output.png" alt="Generated Output Image" > | |
|
|
|
|
|
|
|
|
*W&B loss curve, clearly illustrating the training plateau.* |
|
|
 |
|
|
|
|
|
* **Immediate Goals:** |
|
|
1. **Debug the training process:** Perform sanity checks like overfitting on a single batch to verify the model's learning capacity. |
|
|
2. **Verify the data pipeline:** Thoroughly visualize the inputs (warped clothes, agnostic masks, pose maps) being fed to the model to ensure they are correct. |
|
|
3. **Investigate Loss Function:** The current loss (e.g., L1 or L2) might not be optimal. Experiment with alternatives like a perceptual loss (LPIPS - Learned Perceptual Image Patch Similarity) to better capture visual similarity. |
|
|
4. **Tune Hyperparameters:** Experiment with the learning rate and other key hyperparameters. |
|
|
* **Long-Term Vision:** Resolve the training plateau, scale up the training to a larger dataset, and successfully replicate the results of the TryOnDiffusion paper. |
|
|
|
|
|
|