AI Virtual Try-On System
Model Description
A comprehensive deep learning-based virtual try-on system leveraging multi-modal feature fusion and Generative Adversarial Networks (GANs) to generate photorealistic garment transfer images. This implementation combines cloth-agnostic person representation, pose estimation, and human parsing for identity-preserving virtual try-on.
Key Features:
- 41-channel multi-modal input (3 RGB + 18 pose + 20 parsing)
- U-Net generator with self-attention (26.4M parameters)
- Spectral-normalized PatchGAN discriminator (2.8M parameters)
- Multi-component loss function (adversarial + perceptual + L1 + feature matching)
- Identity-preserving garment transfer
Model Architecture
Person Image β [Human Parsing] β Cloth-Agnostic RGB (3 channels)
β [Pose Estimation] β Gaussian Heatmaps (18 channels)
β [LIP Segmentation] β Parsing Masks (20 channels)
β
Multi-Modal Fusion (41 channels)
β
U-Net Generator + Self-Attention
β
Generated Try-On Image (3 RGB)
β
PatchGAN Discriminator
β
[Real/Fake Classification]
Generator Architecture:
- Input: 41 channels (3 RGB + 18 pose heatmaps + 20 parsing masks)
- Encoder: 4 downsampling stages (64β128β256β512 channels)
- Bottleneck: 9 residual blocks + self-attention mechanism
- Decoder: 4 upsampling stages with skip connections
- Output: 3-channel RGB image (512Γ384 or 1024Γ768)
- Normalization: Instance Normalization
- Parameters: 26.4M
Discriminator Architecture:
- Type: Spectral-Normalized PatchGAN
- Input: 6 channels (3 real/fake + 3 condition)
- Layers: 5 convolutional layers (64β128β256β512β1)
- Receptive Field: 70Γ70 pixels
- Parameters: 2.8M
Performance
| Metric | Value | Description |
|---|---|---|
| SSIM | 0.6247 | Structural similarity index |
| PSNR | 15.23 dB | Peak signal-to-noise ratio |
| L1 Distance | 0.1152 | Pixel-level reconstruction error |
Loss Components:
- Adversarial Loss (LSGAN): 3.7% contribution
- Perceptual Loss (VGG19): 57.9% contribution
- L1 Reconstruction: 37.8% contribution
- Feature Matching: 0.7% contribution
Training Details
Dataset:
- Name: VITON-HD (Zalando HD Resized)
- Training Samples: 10,482 person-garment pairs
- Test Samples: 2,032 pairs
- Resolution: 768Γ1024 (original), 512Γ384 (training)
Training Configuration:
- Framework: PyTorch 2.0+
- Optimizer: Adam (Generator: 2e-4, Discriminator: 1e-4)
- Epochs: 50 recommended (10 minimum)
- Batch Size: 4-8 (GPU) / 1 (CPU)
- Loss Weights: Ξ»_adv=1.0, Ξ»_per=10.0, Ξ»_L1=10.0, Ξ»_FM=10.0
Hardware Requirements:
- Minimum: CPU, 8GB RAM
- Recommended: GPU (RTX 2070+), 8GB VRAM
- Optimal: GPU (RTX 3090+), 16GB VRAM
Usage
Installation
git clone https://github.com/huzaifanasir95/AI-Virtual-TryOn.git
cd AI-Virtual-TryOn
pip install -r requirements.txt
Download Model
from huggingface_hub import hf_hub_download
import torch
# Download model checkpoint
model_path = hf_hub_download(
repo_id="huzaifanasirrr/ai-virtual-tryon",
filename="best_model.pth"
)
# Load checkpoint
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(model_path, map_location=device)
Inference
from src.preprocessing import MultiChannelInputGenerator
from src.models import Generator
import torch
from PIL import Image
# Initialize generator
generator = Generator(input_channels=41, output_channels=3)
generator.load_state_dict(checkpoint['generator_state_dict'])
generator.eval()
generator.to(device)
# Prepare multi-modal input
input_generator = MultiChannelInputGenerator(target_size=(512, 384))
multi_channel_input = input_generator.generate_input(
person_image_path="path/to/person.jpg",
parsing_mask_path="path/to/parsing.png",
keypoints_path="path/to/keypoints.json"
)
# Generate try-on image
with torch.no_grad():
input_tensor = torch.from_numpy(multi_channel_input).unsqueeze(0).to(device)
output = generator(input_tensor)
# Convert to image
output_image = (output.squeeze(0).cpu().numpy().transpose(1, 2, 0) + 1) / 2
output_image = (output_image * 255).astype('uint8')
Image.fromarray(output_image).save("tryon_result.jpg")
Dataset
VITON-HD Dataset Structure:
data/zalando-hd-resized/
βββ train/
β βββ image/ # Person images (11,647 files)
β βββ cloth/ # Garment images (11,647 files)
β βββ image-parse-v3/ # LIP parsing masks (20 classes)
β βββ openpose_json/ # Body25 keypoints (25 joints)
β βββ openpose_img/ # Pose visualizations
β βββ agnostic-v3.2/ # Cloth-agnostic representations
βββ test/
βββ [same structure]
Modalities:
- β Person images (full-body RGB, 768Γ1024)
- β Garment images (isolated clothing items)
- β Human parsing masks (20-class LIP segmentation)
- β OpenPose keypoints (Body25 format, 25 keypoints)
- β Cloth-agnostic representations (pre-computed)
Model Files
best_model.pth- Complete model checkpoint (457 MB)training_curves.png- Training dynamics visualizationcomparison_grid.png- Sample results (input/garment/output/ground truth)loss_comparison.png- Loss component analysisgarment_extraction.png- Human parsing examplespose_detailed_analysis.png- Pose estimation examplespreprocessing_config.yaml- Data preprocessing configurationrequirements.txt- Python dependenciesresearch_paper.pdf- Full research paper (Springer LNCS format)
Key Innovations
- Multi-Modal Feature Fusion: Combines RGB, pose heatmaps, and parsing masks into unified 41-channel representation
- Self-Attention Mechanism: Captures long-range spatial dependencies in bottleneck
- Spectral Normalization: Ensures discriminator training stability
- Multi-Component Loss: Balances adversarial, perceptual, reconstruction, and feature matching objectives
- Identity Preservation: Maintains person's face, hair, and body features while transferring garments
Limitations
- Requires pre-computed human parsing and pose estimation
- Performance depends on quality of input segmentation masks
- Best results with frontal-facing person images
- Garment transfer quality varies with pose complexity
- Requires GPU for real-time inference
Citation
If you use this model in your research, please cite:
@article{nasir2025virtualtryon,
title={Deep Learning-Based Virtual Try-On System Using Multi-Modal Feature Fusion and Generative Adversarial Networks},
author={Nasir, Huzaifa},
year={2025},
publisher={Zenodo},
doi={10.5281/zenodo.18058537},
url={https://doi.org/10.5281/zenodo.18058537},
note={Hugging Face: https://huggingface.co/huzaifanasirrr/ai-virtual-tryon}
}
VITON-HD Dataset:
@inproceedings{choi2021viton,
title={VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization},
author={Choi, Seunghwan and Park, Sunghyun and Lee, Minsoo and Choo, Jaegul},
booktitle={CVPR},
year={2021}
}
Author
Huzaifa Nasir
National University of Computer and Emerging Sciences (FAST-NUCES), Islamabad, Pakistan
π§ nasirhuzaifa95@gmail.com
π GitHub Repository
π Research Paper (Zenodo)
License
MIT License - See LICENSE file for details.
Acknowledgments
This project builds upon:
- VITON-HD: Dataset and baseline methods (Choi et al., 2021)
- OpenPose: Pose estimation (Cao et al., 2019)
- LIP Dataset: Human parsing (Gong et al., 2017)
- Pix2Pix: Image-to-image translation framework (Isola et al., 2017)
Research conducted at FAST-NUCES Islamabad. Special thanks to the open-source community for PyTorch and related libraries.