MODA-Fashion-Distilled

State-of-the-art fashion image-to-image retrieval in a single 768-d embedding.

MODA-Fashion-Distilled is a fine-tuned ViT-B-16-SigLIP model that achieves 67.63% Fine Recall@1 on LookBench, beating all published models including GR-Pro (closed) and Marqo-FashionSigLIP.

Highlights

  • +3.79 Fine R@1 over FashionSigLIP (63.84 → 67.63) on LookBench Overall
  • +4.22 nDCG@5 over GR-Pro (49.80 → 53.85)
  • Same architecture and embedding dimension (768-d) as FashionSigLIP — drop-in replacement
  • 203M parameters, 224×224 input resolution

LookBench Results

Model Params Dim Fine R@1 Coarse R@1 nDCG@5
GR-Pro (closed) — 1024 — — 49.80
FashionSigLIP 203M 768 63.84 83.67 49.63
FashionCLIP 151M 512 59.36 78.46 45.20
MODA-Fashion-Distilled 203M 768 67.63 86.74 53.85

Per-subset Fine Recall@1

Subset Queries FashionSigLIP Ours Delta
RealStudioFlat 1,011 66.96 70.23 +3.27
AIGen-Studio 193 76.68 80.31 +3.63
RealStreetLook 981 56.37 60.24 +3.87
AIGen-StreetLook 160 74.38 81.25 +6.87
Overall 2,345 63.84 67.63 +3.79

Model Spec

Property Value
Architecture ViT-B/16-SigLIP (full CLIP: vision + text)
Parameters 203.2M
Embedding Dimension 768
Output L2-normalized float32 vector
Model Size (safetensors) ~775 MB
Model Size (pytorch .bin) ~775 MB
Input Resolution 224 × 224
Framework OpenCLIP
Precision float32

Inference — Quick Start

A standalone inference.py is included in this directory.

# Single image → 768-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda

Python API

import open_clip
import torch
import torch.nn.functional as F
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP",
    pretrained="path/to/moda-fashion-distilled/open_clip_model.safetensors",
)
model.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = model.encode_image(image)
    features = F.normalize(features, p=2, dim=-1)  # [1, 768]

Image-to-Image Retrieval

query_emb = model.encode_image(query_tensor)     # [1, 768]
gallery_embs = model.encode_image(gallery_tensor) # [N, 768]

query_emb = F.normalize(query_emb, dim=-1)
gallery_embs = F.normalize(gallery_embs, dim=-1)

similarities = query_emb @ gallery_embs.T
top_k = similarities.topk(10, dim=-1)

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Training Details

  • Base model: Marqo-FashionSigLIP (ViT-B-16-SigLIP, webli pretrained)
  • Method: Ensemble distillation from a 3-model 2048-d teacher (MODA-SigLIP-DF2 + FashionSigLIP + FashionCLIP)
  • Loss: RKD-Distance (wt 25) + similarity mimicry (wt 10) + L2 weight drift regularization (wt 0.01)
  • Training data: DeepFashion2 (cross-domain shop↔consumer pairs) + DeepFashion-Multimodal + H&M — no LookBench data used
  • Optimizer: AdamW, LR=5e-6, batch=128
  • Epochs: 2 (best checkpoint at step 500)
  • Hardware: Apple M-series (MPS)

How It Works

  1. Cross-domain fine-tuning: First, the vision encoder was fine-tuned on DeepFashion2's shop-to-consumer image pairs using InfoNCE + weight drift regularization, producing a model that learns cross-domain visual similarity.
  2. Ensemble teacher: Three models (the DF2-finetuned SigLIP + original FashionSigLIP + FashionCLIP) were concatenated into a 2048-d ensemble that scored 67.68 Fine R@1.
  3. Distillation: The ensemble's ranking knowledge was distilled into a single 768-d student using relational knowledge distillation (RKD-Distance) + similarity mimicry, retaining 99.9% of ensemble performance in one forward pass.

Related Models

Model Dim Fine R@1 Best for
MODA-Fashion-Distilled (this model) 768 67.63 Best overall quality
MODA-Fashion-Matryoshka 64-768 67.42 (256d) Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16 768 67.42 Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d 512 67.63 Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2 768 66.52 Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}
Downloads last month
109
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train HopitAI/moda-fashion-distilled