MODA-Fashion-Distilled

State-of-the-art fashion image-to-image retrieval in a single 768-d embedding.

MODA-Fashion-Distilled is a fine-tuned ViT-B-16-SigLIP model that achieves 67.63% Fine Recall@1 on LookBench, beating all published models including GR-Pro (closed) and Marqo-FashionSigLIP.

Highlights

+3.79 Fine R@1 over FashionSigLIP (63.84 → 67.63) on LookBench Overall
+4.22 nDCG@5 over GR-Pro (49.80 → 53.85)
Same architecture and embedding dimension (768-d) as FashionSigLIP — drop-in replacement
203M parameters, 224×224 input resolution

LookBench Results

Model	Params	Dim	Fine R@1	Coarse R@1	nDCG@5
GR-Pro (closed)	—	1024	—	—	49.80
FashionSigLIP	203M	768	63.84	83.67	49.63
FashionCLIP	151M	512	59.36	78.46	45.20
MODA-Fashion-Distilled	203M	768	67.63	86.74	53.85

Per-subset Fine Recall@1

Subset	Queries	FashionSigLIP	Ours	Delta
RealStudioFlat	1,011	66.96	70.23	+3.27
AIGen-Studio	193	76.68	80.31	+3.63
RealStreetLook	981	56.37	60.24	+3.87
AIGen-StreetLook	160	74.38	81.25	+6.87
Overall	2,345	63.84	67.63	+3.79

Model Spec

Property	Value
Architecture	ViT-B/16-SigLIP (full CLIP: vision + text)
Parameters	203.2M
Embedding Dimension	768
Output	L2-normalized float32 vector
Model Size (safetensors)	~775 MB
Model Size (pytorch .bin)	~775 MB
Input Resolution	224 × 224
Framework	OpenCLIP
Precision	float32

Inference — Quick Start

A standalone inference.py is included in this directory.

# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda

Python API

import open_clip
import torch
import torch.nn.functional as F
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP",
    pretrained="path/to/moda-fashion-distilled/open_clip_model.safetensors",
)
model.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = model.encode_image(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]

Image-to-Image Retrieval

query_emb = model.encode_image(query_tensor)     # [1, 768]
gallery_embs = model.encode_image(gallery_tensor) # [N, 768]

query_emb = F.normalize(query_emb, dim=-1)
gallery_embs = F.normalize(gallery_embs, dim=-1)

similarities = query_emb @ gallery_embs.T
top_k = similarities.topk(10, dim=-1)

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Training Details

Base model: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
Method: Ensemble distillation from a 3-model 2048-d teacher (MODA-SigLIP-DF2 + FashionSigLIP + FashionCLIP)
Loss: RKD-Distance (wt 25) + similarity mimicry (wt 10) + L2 weight drift regularization (wt 0.01)
Training data: DeepFashion2 (cross-domain shop↔consumer pairs) + DeepFashion-Multimodal + H&M — no LookBench data used
Optimizer: AdamW, LR=5e-6, batch=128
Epochs: 2 (best checkpoint at step 500)
Hardware: Apple M-series (MPS)

How It Works

Cross-domain fine-tuning: First, the vision encoder was fine-tuned on DeepFashion2's shop-to-consumer image pairs using InfoNCE + weight drift regularization, producing a model that learns cross-domain visual similarity.
Ensemble teacher: Three models (the DF2-finetuned SigLIP + original FashionSigLIP + FashionCLIP) were concatenated into a 2048-d ensemble that scored 67.68 Fine R@1.
Distillation: The ensemble's ranking knowledge was distilled into a single 768-d student using relational knowledge distillation (RKD-Distance) + similarity mimicry, retaining 99.9% of ensemble performance in one forward pass.

Related Models

Model	Dim	Fine R@1	Best for
MODA-Fashion-Distilled (this model)	768	67.63	Best overall quality
MODA-Fashion-Matryoshka	64-768	67.42 (256d)	Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16	768	67.42	Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d	512	67.63	Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2	768	66.52	Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}

Downloads last month: 109

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

HopitAI
/

moda-fashion-distilled