AI Virtual Try-On System

Python 3.8+ PyTorch 2.0+ License: MIT DOI GitHub

Model Description

A comprehensive deep learning-based virtual try-on system leveraging multi-modal feature fusion and Generative Adversarial Networks (GANs) to generate photorealistic garment transfer images. This implementation combines cloth-agnostic person representation, pose estimation, and human parsing for identity-preserving virtual try-on.

Key Features:

  • 41-channel multi-modal input (3 RGB + 18 pose + 20 parsing)
  • U-Net generator with self-attention (26.4M parameters)
  • Spectral-normalized PatchGAN discriminator (2.8M parameters)
  • Multi-component loss function (adversarial + perceptual + L1 + feature matching)
  • Identity-preserving garment transfer

Model Architecture

Person Image β†’ [Human Parsing] β†’ Cloth-Agnostic RGB (3 channels)
            β†’ [Pose Estimation] β†’ Gaussian Heatmaps (18 channels)
            β†’ [LIP Segmentation] β†’ Parsing Masks (20 channels)
                                        ↓
                            Multi-Modal Fusion (41 channels)
                                        ↓
                            U-Net Generator + Self-Attention
                                        ↓
                            Generated Try-On Image (3 RGB)
                                        ↓
                            PatchGAN Discriminator
                                        ↓
                                [Real/Fake Classification]

Generator Architecture:

  • Input: 41 channels (3 RGB + 18 pose heatmaps + 20 parsing masks)
  • Encoder: 4 downsampling stages (64β†’128β†’256β†’512 channels)
  • Bottleneck: 9 residual blocks + self-attention mechanism
  • Decoder: 4 upsampling stages with skip connections
  • Output: 3-channel RGB image (512Γ—384 or 1024Γ—768)
  • Normalization: Instance Normalization
  • Parameters: 26.4M

Discriminator Architecture:

  • Type: Spectral-Normalized PatchGAN
  • Input: 6 channels (3 real/fake + 3 condition)
  • Layers: 5 convolutional layers (64β†’128β†’256β†’512β†’1)
  • Receptive Field: 70Γ—70 pixels
  • Parameters: 2.8M

Performance

Metric Value Description
SSIM 0.6247 Structural similarity index
PSNR 15.23 dB Peak signal-to-noise ratio
L1 Distance 0.1152 Pixel-level reconstruction error

Loss Components:

  • Adversarial Loss (LSGAN): 3.7% contribution
  • Perceptual Loss (VGG19): 57.9% contribution
  • L1 Reconstruction: 37.8% contribution
  • Feature Matching: 0.7% contribution

Training Details

Dataset:

  • Name: VITON-HD (Zalando HD Resized)
  • Training Samples: 10,482 person-garment pairs
  • Test Samples: 2,032 pairs
  • Resolution: 768Γ—1024 (original), 512Γ—384 (training)

Training Configuration:

  • Framework: PyTorch 2.0+
  • Optimizer: Adam (Generator: 2e-4, Discriminator: 1e-4)
  • Epochs: 50 recommended (10 minimum)
  • Batch Size: 4-8 (GPU) / 1 (CPU)
  • Loss Weights: Ξ»_adv=1.0, Ξ»_per=10.0, Ξ»_L1=10.0, Ξ»_FM=10.0

Hardware Requirements:

  • Minimum: CPU, 8GB RAM
  • Recommended: GPU (RTX 2070+), 8GB VRAM
  • Optimal: GPU (RTX 3090+), 16GB VRAM

Usage

Installation

git clone https://github.com/huzaifanasir95/AI-Virtual-TryOn.git
cd AI-Virtual-TryOn
pip install -r requirements.txt

Download Model

from huggingface_hub import hf_hub_download
import torch

# Download model checkpoint
model_path = hf_hub_download(
    repo_id="huzaifanasirrr/ai-virtual-tryon",
    filename="best_model.pth"
)

# Load checkpoint
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(model_path, map_location=device)

Inference

from src.preprocessing import MultiChannelInputGenerator
from src.models import Generator
import torch
from PIL import Image

# Initialize generator
generator = Generator(input_channels=41, output_channels=3)
generator.load_state_dict(checkpoint['generator_state_dict'])
generator.eval()
generator.to(device)

# Prepare multi-modal input
input_generator = MultiChannelInputGenerator(target_size=(512, 384))
multi_channel_input = input_generator.generate_input(
    person_image_path="path/to/person.jpg",
    parsing_mask_path="path/to/parsing.png",
    keypoints_path="path/to/keypoints.json"
)

# Generate try-on image
with torch.no_grad():
    input_tensor = torch.from_numpy(multi_channel_input).unsqueeze(0).to(device)
    output = generator(input_tensor)
    
# Convert to image
output_image = (output.squeeze(0).cpu().numpy().transpose(1, 2, 0) + 1) / 2
output_image = (output_image * 255).astype('uint8')
Image.fromarray(output_image).save("tryon_result.jpg")

Dataset

VITON-HD Dataset Structure:

data/zalando-hd-resized/
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ image/              # Person images (11,647 files)
β”‚   β”œβ”€β”€ cloth/              # Garment images (11,647 files)
β”‚   β”œβ”€β”€ image-parse-v3/     # LIP parsing masks (20 classes)
β”‚   β”œβ”€β”€ openpose_json/      # Body25 keypoints (25 joints)
β”‚   β”œβ”€β”€ openpose_img/       # Pose visualizations
β”‚   └── agnostic-v3.2/      # Cloth-agnostic representations
└── test/
    └── [same structure]

Modalities:

  • βœ… Person images (full-body RGB, 768Γ—1024)
  • βœ… Garment images (isolated clothing items)
  • βœ… Human parsing masks (20-class LIP segmentation)
  • βœ… OpenPose keypoints (Body25 format, 25 keypoints)
  • βœ… Cloth-agnostic representations (pre-computed)

Model Files

  • best_model.pth - Complete model checkpoint (457 MB)
  • training_curves.png - Training dynamics visualization
  • comparison_grid.png - Sample results (input/garment/output/ground truth)
  • loss_comparison.png - Loss component analysis
  • garment_extraction.png - Human parsing examples
  • pose_detailed_analysis.png - Pose estimation examples
  • preprocessing_config.yaml - Data preprocessing configuration
  • requirements.txt - Python dependencies
  • research_paper.pdf - Full research paper (Springer LNCS format)

Key Innovations

  1. Multi-Modal Feature Fusion: Combines RGB, pose heatmaps, and parsing masks into unified 41-channel representation
  2. Self-Attention Mechanism: Captures long-range spatial dependencies in bottleneck
  3. Spectral Normalization: Ensures discriminator training stability
  4. Multi-Component Loss: Balances adversarial, perceptual, reconstruction, and feature matching objectives
  5. Identity Preservation: Maintains person's face, hair, and body features while transferring garments

Limitations

  • Requires pre-computed human parsing and pose estimation
  • Performance depends on quality of input segmentation masks
  • Best results with frontal-facing person images
  • Garment transfer quality varies with pose complexity
  • Requires GPU for real-time inference

Citation

If you use this model in your research, please cite:

@article{nasir2025virtualtryon,
  title={Deep Learning-Based Virtual Try-On System Using Multi-Modal Feature Fusion and Generative Adversarial Networks},
  author={Nasir, Huzaifa},
  year={2025},
  publisher={Zenodo},
  doi={10.5281/zenodo.18058537},
  url={https://doi.org/10.5281/zenodo.18058537},
  note={Hugging Face: https://huggingface.co/huzaifanasirrr/ai-virtual-tryon}
}

VITON-HD Dataset:

@inproceedings{choi2021viton,
  title={VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization},
  author={Choi, Seunghwan and Park, Sunghyun and Lee, Minsoo and Choo, Jaegul},
  booktitle={CVPR},
  year={2021}
}

Author

Huzaifa Nasir
National University of Computer and Emerging Sciences (FAST-NUCES), Islamabad, Pakistan
πŸ“§ nasirhuzaifa95@gmail.com
πŸ”— GitHub Repository
πŸ“„ Research Paper (Zenodo)

License

MIT License - See LICENSE file for details.

Acknowledgments

This project builds upon:

Research conducted at FAST-NUCES Islamabad. Special thanks to the open-source community for PyTorch and related libraries.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support