Overview
JCo-MVTON introduces a novel framework for mask-free virtual try-on based on MM-DiT that addresses key limitations of existing systems: rigid dependencies on human body masks, limited fine-grained control over garment attributes, and poor generalization to in-the-wild scenarios.
Basic Usage
# Load transformer with additional branches
transformer = FluxTransformer2DModel.from_pretrained(
model_id,
torch_dtype=torch_dtype,
subfolder="transformer",
extra_branch_num=extra_branch_num,
low_cpu_mem_usage=False,
).to(device)
# Load and preprocess images
person = Image.open('assets/ref.jpg').convert("RGB").resize((width, height))
cloth = Image.open('assets/upper.jpg').convert("RGB").resize((height, height))
person_tensor = transform_person(person)
cloth_tensor = transform_cloth(cloth)
prompt = "A fashion model wearing stylish clothing, high-resolution 8k, detailed textures, realistic lighting, fashion photography style."
# Generate image
with torch.inference_mode():
generated_image = pipe(
generator=torch.Generator(device="cpu").manual_seed(seed),
prompt=prompt,
num_inference_steps=n_steps,
guidance_scale=guidance_scale,
height=height,
width=width,
cloth_img=cloth_tensor,
person_img=person_tensor,
extra_branch_num=extra_branch_num,
mode=mode,
max_sequence_length=77,
).images[0]
# Save result
person_tensor = transform_output(person)
cloth_tensor = transform_output(cloth)
generated_tensor = transform_output(generated_image)
concatenated_tensor = torch.cat((cloth_tensor, person_tensor, generated_tensor), dim=2)
vutils.save_image(concatenated_tensor, 'output.png')
Results
JCo-MVTON achieves state-of-the-art performance across multiple metrics:
| Methods | Paired | Paired | Paired | Paired | Unpaired | Unpaired |
|---|---|---|---|---|---|---|
| SSIM โ | FID โ | KID โ | LPIPS โ | FID โ | KID โ | |
| MV-VTON (Wang et al., 2025b) | 0.8083 | 15.442 | 7.501 | 0.1171 | 17.900 | 3.861 |
| OOTDiffusion (Xu et al., 2024) | 0.8187 | 9.305 | 4.086 | 0.0876 | 12.408 | 4.689 |
| JCo-MVTON (Ours) | 0.8601 | 8.103 | 2.003 | 0.0891 | 9.561 | 2.700 |
Citation
If you find our work useful, please cite:
@article{wang2024jco,
title={JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on},
author={Wang, Aowen and Li, Wei and Luo, Hao and Ao, Mengxing and Wang, Fan},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2024}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support