| <!DOCTYPE html> |
| <html> |
| <head> |
| <meta charset="utf-8"> |
| <meta name="description" |
| content="🐈 CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models"> |
| <meta name="keywords" content=""> |
| <meta name="viewport" content="width=device-width, initial-scale=1"> |
|
|
| <title>🐈 CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models</title> |
| <script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script> |
| <script> |
| window.dataLayer = window.dataLayer || []; |
| function gtag() { |
| dataLayer.push(arguments); |
| } |
| gtag('js', new Date()); |
| gtag('config', 'G-PYVRSFMDRL'); |
| </script> |
|
|
|
|
| <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" |
| rel="stylesheet"> |
| <link rel="stylesheet" href="resource/css/bulma.min.css"> |
| <link rel="stylesheet" href="resource/css/bulma-carousel.min.css"> |
| <link rel="stylesheet" href="resource/css/bulma-slider.min.css"> |
| <link rel="stylesheet" href="resource/css/fontawesome.all.min.css"> |
| <link rel="stylesheet" |
| href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> |
| <link rel="stylesheet" href="resource/css/index.css"> |
| <link rel="icon" href="resource/images/favicon.svg"> |
|
|
| <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> |
| <script defer src="resource/js/fontawesome.all.min.js"></script> |
| <script src="resource/js/bulma-carousel.min.js"></script> |
| <script src="resource/js/bulma-slider.min.js"></script> |
| <script src="resource/js/index.js"></script> |
| </head> |
| <body> |
|
|
|
|
| <section class="hero"> |
| <div class="hero-body"> |
| <div class="container is-max-desktop"> |
| <div class="columns is-centered"> |
| <div class="column has-text-centered"> |
| <h1 class="title is-1 publication-title">🐈 CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models</h1> |
| <div class="is-size-5 publication-authors"> |
| <span class="author-block"> |
| <a href="">Zheng Chong</a><sup>1,3</sup>,</span> |
| <span class="author-block"> |
| <a href="">Xiao Dong</a><sup>1</sup>,</span> |
| <span class="author-block"> |
| <a href="">Haoxiang Li</a><sup>2</sup>,</span> |
| <span class="author-block"> |
| <a href="">Shiyue Zhang</a><sup>1</sup>, |
| </span> |
| <span class="author-block"> |
| <a href="">Wenqing Zhang</a><sup>1</sup>, |
| </span> |
| <span class="author-block"> |
| <a href="">Xujie Zhang</a><sup>1</sup>, |
| </span> |
| <span class="author-block"> |
| <a href="">Hanqing Zhao</a><sup>3,4</sup>, |
| </span> |
| <span class="author-block"> |
| <a href="">Xiaodan Liang</a><sup>*1,3</sup>, |
| </span> |
| </div> |
| <div class="is-size-5 publication-authors"> |
| <span class="author-block"><sup>1</sup>Sun Yat-Sen University,</span> |
| <span class="author-block"><sup>2</sup>Pixocial Technology,</span> |
| <span class="author-block"><sup>3</sup>Peng Cheng Laboratory,</span> |
| <span class="author-block"><sup>4</sup>SIAT</span> |
|
|
| </div> |
|
|
| <div class="column has-text-centered"> |
| <div class="publication-links"> |
| |
| <span class="link-block"> |
| <a href="https://arxiv.org/pdf/2407.15886" |
| class="external-link button is-normal is-rounded is-dark"> |
| <span class="icon"> |
| <i class="fas fa-file-pdf"></i> |
| </span> |
| <span>Paper</span> |
| </a> |
| </span> |
| |
| <span class="link-block"> |
| <a href="http://arxiv.org/abs/2407.15886" |
| class="external-link button is-normal is-rounded is-dark"> |
| <span class="icon"> |
| <i class="ai ai-arxiv"></i> |
| </span> |
| <span>arXiv</span> |
| </a> |
| </span> |
| |
| <span class="link-block"> |
| <a href="http://120.76.142.206:8888" |
| class="external-link button is-normal is-rounded is-dark"> |
| <span class="icon"> |
| <i class="fas fa-gamepad"></i> |
| </span> |
| <span>Demo</span> |
| </a> |
| </span> |
| |
| <span class="link-block"> |
| <a href="https://huggingface.co/spaces/zhengchong/CatVTON" |
| class="external-link button is-normal is-rounded is-dark"> |
| <span class="icon"> |
| <i class="fas fa-gamepad"></i> |
| </span> |
| <span>Space</span> |
| </a> |
| </span> |
| |
| <span class="link-block"> |
| <a href="https://huggingface.co/zhengchong/CatVTON" |
| class="external-link button is-normal is-rounded is-dark"> |
| <span class="icon"> |
| <i class="fas fa-cube"></i> |
| </span> |
| <span>Models</span> |
| </a> |
| </span> |
| |
| <span class="link-block"> |
| <a href="https://github.com/Zheng-Chong/CatVTON" |
| class="external-link button is-normal is-rounded is-dark"> |
| <span class="icon"> |
| <i class="fab fa-github"></i> |
| </span> |
| <span>Code</span> |
| </a> |
| </span> |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |
| </section> |
|
|
| <section class="hero teaser"> |
| <div class="container is-max-desktop"> |
| <div class="hero-body"> |
| <img src="resource/img/teaser.jpg" alt="teaser"> |
| <p> |
| CatVTON is a simple and efficient virtual try-on diffusion model with 1) Lightweight Network (899.06M parameters totally), |
| 2) Parameter-Efficient Training (49.57M parameters trainable) and 3) Simplified Inference (< 8G VRAM for 1024X768 |
| resolution). |
| </p> |
| </div> |
| </div> |
| </section> |
|
|
| |
| <section class="section"> |
| <div class="container is-max-desktop"> |
| |
| <div class="columns is-centered has-text-centered"> |
| <div class="column is-four-fifths"> |
| <h2 class="title is-3">Abstract</h2> |
| <div class="content has-text-justified"> |
| <p> |
| Virtual try-on methods based on diffusion models achieve realistic try-on effects but replicate the backbone network |
| as a ReferenceNet or leverage additional image encoders to process condition inputs, resulting in high training and |
| inference costs. |
| In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment |
| and person, proposing CatVTON, a simple and efficient virtual try-on diffusion model. It facilitates the seamless |
| transfer of in-shop or worn garments of arbitrary categories to target persons by simply concatenating them in spatial |
| dimensions as inputs. The efficiency of our model is demonstrated in three aspects: |
|
|
| (1) Lightweight network. Only the original diffusion modules are used, without additional network modules. The text |
| encoder and cross attentions for text injection in the backbone are removed, further reducing the parameters by 167.02M. |
|
|
| (2) Parameter-efficient training. We identified the try-on relevant modules through experiments and achieved |
| high-quality try-on effects by training only 49.57M parameters (~5.51% of the backbone network’s parameters). |
|
|
| (3) Simplified inference. CatVTON eliminates all unnecessary conditions and preprocessing steps, including |
| pose estimation, human parsing, and text input, requiring only garment reference, target person image, and mask for |
| the virtual try-on process. |
|
|
| Extensive experiments demonstrate that CatVTON achieves superior qualitative and |
| quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, |
| CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples. |
| </p> |
| </div> |
| </div> |
| </div> |
| |
| </div> |
| </section> |
|
|
|
|
| <section class="section"> |
| <div class="container is-max-desktop"> |
| |
| <div class="columns is-centered"> |
| <div class="column is-full-width"> |
| <h2 class="title is-3">Architecture</h2> |
| <div class="content has-text-justified"> |
| <img src="resource/img/architecture.jpg"> |
| <p> |
| Our method achieves the high-quality try-on by simply concatenating the conditional image (garment or reference person) |
| with the target person image in the spatial dimension, ensuring they remain in the same feature space throughout the |
| diffusion process. Only the self-attention parameters, which provide global interaction, are learnable during training. |
| Unnecessary cross-attention for text interaction is omitted, and no additional conditions, such as pose and parsing, |
| are required. These factors result in a lightweight network with minimal trainable parameters and simplified inference. |
| </p> |
| |
| </div> |
| </div> |
| </div> |
| |
| <div class="columns is-centered"> |
| |
| <div class="column"> |
| <div class="content"> |
| <h2 class="title is-3">Structure Comparison</h2> |
| <p> |
| We illustrate simple structure comparison of different kinds of try-on methods below. Our approach neither relies on warped garments nor |
| requires the heavy ReferenceNet for additional garment encoding; it only needs simple concatenation of the garment |
| and person images as input to obtain high-quality try-on results. |
| </p> |
| <img src="resource/img/structure.jpg"> |
| </div> |
| </div> |
|
|
| |
| <div class="column"> |
| <h2 class="title is-3">Efficiency Comparison</h2> |
| <div class="columns is-centered"> |
| <div class="column content"> |
| <p> |
| We represent each method by two concentric circles, |
| where the outer circle denotes the total parameters and the inner circle denotes the trainable parameters, with the |
| area proportional to the parameter number. CatVTON achieves lower FID on the VITONHD dataset with fewer total |
| parameters, trainable parameters, and memory usage. |
| </p> |
| <img src="resource/img/efficency.jpg"> |
| </div> |
|
|
| </div> |
| </div> |
| </div> |
|
|
| |
| <div class="columns is-centered"> |
| <div class="column is-full-width"> |
| <h2 class="title is-3">Online Demo</h2> |
| <div class="content has-text-justified"> |
| |
| |
| <p> |
| Since GitHub Pages does not support embedded web pages, please jump to our <a href="http://120.76.142.206:8888">Demo </a>. |
| </p> |
| </div> |
| </div> |
| </div> |
|
|
| |
| <div class="columns is-centered"> |
| <div class="column is-full-width"> |
| <h2 class="title is-3">Acknowledgement</h2> |
| <div class="content has-text-justified"> |
| <p> |
| Our code is modified based on <a href="https://github.com/huggingface/diffusers">Diffusers</a>. |
| We adopt <a href="https://huggingface.co/runwayml/stable-diffusion-inpainting">Stable Diffusion v1.5 inpainitng</a> as base model. |
| We use <a href="https://github.com/GoGoDuck912/Self-Correction-Human-Parsing/tree/master">SCHP</a> |
| and <a href="https://github.com/facebookresearch/DensePose">DensePose</a> to automatically generate masks in our |
| <a href="https://github.com/gradio-app/gradio">Gradio</a> App. |
| Thanks to all the contributors! |
| </p> |
| </div> |
| </div> |
| </div> |
| |
|
|
| <div class="container is-max-desktop content"> |
| <h2 class="title">BibTeX</h2> |
| <pre><code> |
| @misc{chong2024catvtonconcatenationneedvirtual, |
| title={CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models}, |
| author={Zheng Chong and Xiao Dong and Haoxiang Li and Shiyue Zhang and Wenqing Zhang and Xujie Zhang and Hanqing Zhao and Xiaodan Liang}, |
| year={2024}, |
| eprint={2407.15886}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2407.15886}, |
| } |
| </code></pre> |
| </div> |
| </div> |
| </section> |
|
|
|
|
|
|
| <footer class="footer"> |
| <div class="container"> |
| <div class="content has-text-centered"> |
| <a class="icon-link" href="http://arxiv.org/abs/2407.15886" class="external-link" disabled> |
| <i class="ai ai-arxiv"></i> |
| </a> |
| <a class="icon-link" href="https://arxiv.org/pdf/2407.15886"> |
| <i class="fas fa-file-pdf"></i> |
| </a> |
| <a class="icon-link" href="http://120.76.142.206:8888" class="external-link" disabled> |
| <i class="fas fa-gamepad"></i> |
| </a> |
| <a class="icon-link" href="https://github.com/Zheng-Chong/CatVTON" class="external-link" disabled> |
| <i class="fab fa-github"></i> |
| </a> |
|
|
| <a class="icon-link" href="https://huggingface.co/zhengchong/CatVTON" class="external-link" disabled> |
| <i class="fas fa-cube"></i> |
| </a> |
| |
| </div> |
| <div class="columns is-centered"> |
| <div class="column is-8"> |
| <div class="content"> |
| <p> |
| This website is modified from <a href="https://nerfies.github.io/">Nerfies</a>. Thanks for the great work! |
| Their source code is available on <a href="https://github.com/nerfies/nerfies.github.io">GitHub</a>. |
| </p> |
| </div> |
| </div> |
| </div> |
| </div> |
| </footer> |
|
|
| </body> |
| </html> |