MODA-Fashion-Distilled-512d

Same quality as the 768-d model in a compact 512-d embedding.

MODA-Fashion-Distilled-512d combines the FashionSigLIP backbone with a learned 768→512 linear projection, trained via ensemble distillation. It achieves 67.63% Fine Recall@1 on LookBench — identical to the 768-d variant — while producing 33% smaller embeddings.

Highlights

67.63% Fine R@1 on LookBench — tied with the 768-d distilled model
54.11 nDCG@5 — highest of all models (even beats 768-d's 53.85)
512-d output — 33% smaller index than 768-d models
PCA-initialized projection head, refined with RKD-Distance loss

LookBench Results

Model	Params	Dim	Fine R@1	Coarse R@1	nDCG@5
FashionSigLIP	203M	768	63.84	83.67	49.63
FashionCLIP	151M	512	59.36	78.46	45.20
MODA-Fashion-Distilled	203M	768	67.63	86.74	53.85
MODA-Fashion-Distilled-512d	203M	512	67.63	86.87	54.11

Per-subset Fine Recall@1

Subset	Queries	FashionSigLIP	Ours	Delta
RealStudioFlat	1,011	66.96	70.23	+3.27
AIGen-Studio	193	76.68	78.24	+1.56
RealStreetLook	981	56.37	60.24	+3.87
AIGen-StreetLook	160	74.38	83.75	+9.37
Overall	2,345	63.84	67.63	+3.79

Model Spec

Property	Value
Architecture	ViT-B/16-SigLIP + Linear(768→512) projection
Parameters	203.2M (backbone) + 393K (projection) = 203.6M
Embedding Dimension	512
Output	L2-normalized float32 vector
Model Size (safetensors)	~777 MB
Input Resolution	224 × 224
Framework	OpenCLIP + custom projection head
Precision	float32

Inference — Quick Start

A standalone inference.py is included in this directory.

# Single image → 512-d embedding
python inference.py --image query.jpg

# Two images → embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity

# Run on GPU/MPS
python inference.py --image query.jpg --device cuda

Python API

import open_clip
import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from safetensors.torch import load_file

# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# MODA's state_dict below contains all 162 visual.* keys, so no random-init values
# leak into the visual tower. Avoids the ~775 MB Marqo checkpoint download.
backbone, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16-SigLIP", pretrained=None
)

state = load_file("path/to/moda-fashion-distilled-512d/model.safetensors")
proj_weight = state.pop("proj.weight")
backbone.load_state_dict(state, strict=False)
backbone.eval()

proj = nn.Linear(768, 512, bias=False)
proj.weight.data.copy_(proj_weight)
proj.eval()

image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    features = backbone.encode_image(image)
    features = F.normalize(proj(features), p=2, dim=-1)  # [1, 512]

Requirements

open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors

Training Details

Base model: MODA-Fashion-Distilled (768-d, the distilled student)
Projection: Linear(768 → 512, no bias), PCA-initialized from 4096 training samples
Teacher: 2048-d ensemble (MODA-SigLIP-DF2 + FashionSigLIP + FashionCLIP)
Loss: RKD-Distance (wt 25) + similarity mimicry (wt 10) + L2 weight drift (wt 0.01)
Optimizer: AdamW, backbone LR=5e-6, projection head LR=5e-4
Training data: DeepFashion-InShop + DeepFashion-Multimodal + H&M (30K limit)
Epochs: 3 (best at step 800)
Hardware: Apple M-series (MPS)

Related Models

Model	Dim	Fine R@1	Best for
MODA-Fashion-Distilled	768	67.63	Best overall quality
MODA-Fashion-Matryoshka	64-768	67.42 (256d)	Flexible dim, 3x smaller index
MODA-Fashion-Vision-FP16	768	67.42	Smallest (186 MB), edge/mobile
MODA-Fashion-Distilled-512d (this model)	512	67.63	Compact index, highest nDCG@5
MODA-Fashion-DeepFashion2	768	66.52	Simplest recipe, no distillation

License

MIT

Citation

If you use this model, please cite:

@software{moda2026,
  title  = {MODA: Open-source benchmark and models for fashion search},
  author = {Hopit AI},
  year   = {2026},
  url    = {https://github.com/hopit-ai/Moda}
}

Downloads last month: 9

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

HopitAI
/

moda-fashion-distilled-512d