MODA-Fashion-Distilled
State-of-the-art fashion image-to-image retrieval in a single 768-d embedding.
MODA-Fashion-Distilled is a fine-tuned ViT-B-16-SigLIP model that achieves 67.63% Fine Recall@1 on LookBench, beating all published models including GR-Pro (closed) and Marqo-FashionSigLIP.
Highlights
- +3.79 Fine R@1 over FashionSigLIP (63.84 → 67.63) on LookBench Overall
- +4.22 nDCG@5 over GR-Pro (49.80 → 53.85)
- Same architecture and embedding dimension (768-d) as FashionSigLIP — drop-in replacement
- 203M parameters, 224×224 input resolution
LookBench Results
| Model | Params | Dim | Fine R@1 | Coarse R@1 | nDCG@5 |
|---|---|---|---|---|---|
| GR-Pro (closed) | — | 1024 | — | — | 49.80 |
| FashionSigLIP | 203M | 768 | 63.84 | 83.67 | 49.63 |
| FashionCLIP | 151M | 512 | 59.36 | 78.46 | 45.20 |
| MODA-Fashion-Distilled | 203M | 768 | 67.63 | 86.74 | 53.85 |
Per-subset Fine Recall@1
| Subset | Queries | FashionSigLIP | Ours | Delta |
|---|---|---|---|---|
| RealStudioFlat | 1,011 | 66.96 | 70.23 | +3.27 |
| AIGen-Studio | 193 | 76.68 | 80.31 | +3.63 |
| RealStreetLook | 981 | 56.37 | 60.24 | +3.87 |
| AIGen-StreetLook | 160 | 74.38 | 81.25 | +6.87 |
| Overall | 2,345 | 63.84 | 67.63 | +3.79 |
Model Spec
| Property | Value |
|---|---|
| Architecture | ViT-B/16-SigLIP (full CLIP: vision + text) |
| Parameters | 203.2M |
| Embedding Dimension | 768 |
| Output | L2-normalized float32 vector |
| Model Size (safetensors) | ~775 MB |
| Model Size (pytorch .bin) | ~775 MB |
| Input Resolution | 224 × 224 |
| Framework | OpenCLIP |
| Precision | float32 |
Inference — Quick Start
A standalone inference.py is included in this directory.
# Single image → 768-d embedding
python inference.py --image query.jpg
# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity
# Run on GPU/MPS
python inference.py --image query.jpg --device cuda
Python API
import open_clip
import torch
import torch.nn.functional as F
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP",
pretrained="path/to/moda-fashion-distilled/open_clip_model.safetensors",
)
model.eval()
image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
features = model.encode_image(image)
features = F.normalize(features, p=2, dim=-1) # [1, 768]
Image-to-Image Retrieval
query_emb = model.encode_image(query_tensor) # [1, 768]
gallery_embs = model.encode_image(gallery_tensor) # [N, 768]
query_emb = F.normalize(query_emb, dim=-1)
gallery_embs = F.normalize(gallery_embs, dim=-1)
similarities = query_emb @ gallery_embs.T
top_k = similarities.topk(10, dim=-1)
Requirements
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
Training Details
- Base model: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
- Method: Ensemble distillation from a 3-model 2048-d teacher (MODA-SigLIP-DF2 + FashionSigLIP + FashionCLIP)
- Loss: RKD-Distance (wt 25) + similarity mimicry (wt 10) + L2 weight drift regularization (wt 0.01)
- Training data: DeepFashion2 (cross-domain shop↔consumer pairs) + DeepFashion-Multimodal + H&M — no LookBench data used
- Optimizer: AdamW, LR=5e-6, batch=128
- Epochs: 2 (best checkpoint at step 500)
- Hardware: Apple M-series (MPS)
How It Works
- Cross-domain fine-tuning: First, the vision encoder was fine-tuned on DeepFashion2's shop-to-consumer image pairs using InfoNCE + weight drift regularization, producing a model that learns cross-domain visual similarity.
- Ensemble teacher: Three models (the DF2-finetuned SigLIP + original FashionSigLIP + FashionCLIP) were concatenated into a 2048-d ensemble that scored 67.68 Fine R@1.
- Distillation: The ensemble's ranking knowledge was distilled into a single 768-d student using relational knowledge distillation (RKD-Distance) + similarity mimicry, retaining 99.9% of ensemble performance in one forward pass.
Related Models
| Model | Dim | Fine R@1 | Best for |
|---|---|---|---|
| MODA-Fashion-Distilled (this model) | 768 | 67.63 | Best overall quality |
| MODA-Fashion-Matryoshka | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| MODA-Fashion-Vision-FP16 | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| MODA-Fashion-Distilled-512d | 512 | 67.63 | Compact index, highest nDCG@5 |
| MODA-Fashion-DeepFashion2 | 768 | 66.52 | Simplest recipe, no distillation |
License
MIT
Citation
If you use this model, please cite:
@software{moda2026,
title = {MODA: Open-source benchmark and models for fashion search},
author = {Hopit AI},
year = {2026},
url = {https://github.com/hopit-ai/Moda}
}
- Downloads last month
- 109
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support