AI Virtual Try-On System

Model Description

A comprehensive deep learning-based virtual try-on system leveraging multi-modal feature fusion and Generative Adversarial Networks (GANs) to generate photorealistic garment transfer images. This implementation combines cloth-agnostic person representation, pose estimation, and human parsing for identity-preserving virtual try-on.

Key Features:

41-channel multi-modal input (3 RGB + 18 pose + 20 parsing)
U-Net generator with self-attention (26.4M parameters)
Spectral-normalized PatchGAN discriminator (2.8M parameters)
Multi-component loss function (adversarial + perceptual + L1 + feature matching)
Identity-preserving garment transfer

Model Architecture

Person Image → [Human Parsing] → Cloth-Agnostic RGB (3 channels)
            → [Pose Estimation] → Gaussian Heatmaps (18 channels)
            → [LIP Segmentation] → Parsing Masks (20 channels)
                                        ↓
                            Multi-Modal Fusion (41 channels)
                                        ↓
                            U-Net Generator + Self-Attention
                                        ↓
                            Generated Try-On Image (3 RGB)
                                        ↓
                            PatchGAN Discriminator
                                        ↓
                                [Real/Fake Classification]

Generator Architecture:

Input: 41 channels (3 RGB + 18 pose heatmaps + 20 parsing masks)
Encoder: 4 downsampling stages (64→128→256→512 channels)
Bottleneck: 9 residual blocks + self-attention mechanism
Decoder: 4 upsampling stages with skip connections
Output: 3-channel RGB image (512×384 or 1024×768)
Normalization: Instance Normalization
Parameters: 26.4M

Discriminator Architecture:

Type: Spectral-Normalized PatchGAN
Input: 6 channels (3 real/fake + 3 condition)
Layers: 5 convolutional layers (64→128→256→512→1)
Receptive Field: 70×70 pixels
Parameters: 2.8M

Performance

Metric	Value	Description
SSIM	0.6247	Structural similarity index
PSNR	15.23 dB	Peak signal-to-noise ratio
L1 Distance	0.1152	Pixel-level reconstruction error

Loss Components:

Adversarial Loss (LSGAN): 3.7% contribution
Perceptual Loss (VGG19): 57.9% contribution
L1 Reconstruction: 37.8% contribution
Feature Matching: 0.7% contribution

Training Details

Dataset:

Name: VITON-HD (Zalando HD Resized)
Training Samples: 10,482 person-garment pairs
Test Samples: 2,032 pairs
Resolution: 768×1024 (original), 512×384 (training)

Training Configuration:

Framework: PyTorch 2.0+
Optimizer: Adam (Generator: 2e-4, Discriminator: 1e-4)
Epochs: 50 recommended (10 minimum)
Batch Size: 4-8 (GPU) / 1 (CPU)
Loss Weights: λ_adv=1.0, λ_per=10.0, λ_L1=10.0, λ_FM=10.0

Hardware Requirements:

Minimum: CPU, 8GB RAM
Recommended: GPU (RTX 2070+), 8GB VRAM
Optimal: GPU (RTX 3090+), 16GB VRAM

Usage

Installation

git clone https://github.com/huzaifanasir95/AI-Virtual-TryOn.git
cd AI-Virtual-TryOn
pip install -r requirements.txt

Download Model

from huggingface_hub import hf_hub_download
import torch

# Download model checkpoint
model_path = hf_hub_download(
    repo_id="huzaifanasirrr/ai-virtual-tryon",
    filename="best_model.pth"
)

# Load checkpoint
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load(model_path, map_location=device)

Inference

from src.preprocessing import MultiChannelInputGenerator
from src.models import Generator
import torch
from PIL import Image

# Initialize generator
generator = Generator(input_channels=41, output_channels=3)
generator.load_state_dict(checkpoint['generator_state_dict'])
generator.eval()
generator.to(device)

# Prepare multi-modal input
input_generator = MultiChannelInputGenerator(target_size=(512, 384))
multi_channel_input = input_generator.generate_input(
    person_image_path="path/to/person.jpg",
    parsing_mask_path="path/to/parsing.png",
    keypoints_path="path/to/keypoints.json"
)

# Generate try-on image
with torch.no_grad():
    input_tensor = torch.from_numpy(multi_channel_input).unsqueeze(0).to(device)
    output = generator(input_tensor)
    
# Convert to image
output_image = (output.squeeze(0).cpu().numpy().transpose(1, 2, 0) + 1) / 2
output_image = (output_image * 255).astype('uint8')
Image.fromarray(output_image).save("tryon_result.jpg")

Dataset

VITON-HD Dataset Structure:

data/zalando-hd-resized/
├── train/
│   ├── image/              # Person images (11,647 files)
│   ├── cloth/              # Garment images (11,647 files)
│   ├── image-parse-v3/     # LIP parsing masks (20 classes)
│   ├── openpose_json/      # Body25 keypoints (25 joints)
│   ├── openpose_img/       # Pose visualizations
│   └── agnostic-v3.2/      # Cloth-agnostic representations
└── test/
    └── [same structure]

Modalities:

✅ Person images (full-body RGB, 768×1024)
✅ Garment images (isolated clothing items)
✅ Human parsing masks (20-class LIP segmentation)
✅ OpenPose keypoints (Body25 format, 25 keypoints)
✅ Cloth-agnostic representations (pre-computed)

Model Files

best_model.pth - Complete model checkpoint (457 MB)
training_curves.png - Training dynamics visualization
comparison_grid.png - Sample results (input/garment/output/ground truth)
loss_comparison.png - Loss component analysis
garment_extraction.png - Human parsing examples
pose_detailed_analysis.png - Pose estimation examples
preprocessing_config.yaml - Data preprocessing configuration
requirements.txt - Python dependencies
research_paper.pdf - Full research paper (Springer LNCS format)

Key Innovations

Multi-Modal Feature Fusion: Combines RGB, pose heatmaps, and parsing masks into unified 41-channel representation
Self-Attention Mechanism: Captures long-range spatial dependencies in bottleneck
Spectral Normalization: Ensures discriminator training stability
Multi-Component Loss: Balances adversarial, perceptual, reconstruction, and feature matching objectives
Identity Preservation: Maintains person's face, hair, and body features while transferring garments

Limitations

Requires pre-computed human parsing and pose estimation
Performance depends on quality of input segmentation masks
Best results with frontal-facing person images
Garment transfer quality varies with pose complexity
Requires GPU for real-time inference

Citation

If you use this model in your research, please cite:

@article{nasir2025virtualtryon,
  title={Deep Learning-Based Virtual Try-On System Using Multi-Modal Feature Fusion and Generative Adversarial Networks},
  author={Nasir, Huzaifa},
  year={2025},
  publisher={Zenodo},
  doi={10.5281/zenodo.18058537},
  url={https://doi.org/10.5281/zenodo.18058537},
  note={Hugging Face: https://huggingface.co/huzaifanasirrr/ai-virtual-tryon}
}

VITON-HD Dataset:

@inproceedings{choi2021viton,
  title={VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization},
  author={Choi, Seunghwan and Park, Sunghyun and Lee, Minsoo and Choo, Jaegul},
  booktitle={CVPR},
  year={2021}
}

Author

Huzaifa Nasir
National University of Computer and Emerging Sciences (FAST-NUCES), Islamabad, Pakistan
📧 nasirhuzaifa95@gmail.com
🔗 GitHub Repository
📄 Research Paper (Zenodo)

License

MIT License - See LICENSE file for details.

Acknowledgments

This project builds upon:

VITON-HD: Dataset and baseline methods (Choi et al., 2021)
OpenPose: Pose estimation (Cao et al., 2019)
LIP Dataset: Human parsing (Gong et al., 2017)
Pix2Pix: Image-to-image translation framework (Isola et al., 2017)

Research conducted at FAST-NUCES Islamabad. Special thanks to the open-source community for PyTorch and related libraries.

Downloads last month: -; Downloads are not tracked for this model. How to track