Matryoshka Representation Learning
Paper β’ 2205.13147 β’ Published β’ 25
Flexible-dimension fashion embeddings β choose your own size from 64 to 768.
MODA-Fashion-Matryoshka uses Matryoshka Representation Learning to produce nested embeddings where the leading N dimensions are themselves a valid embedding. This means you can use 256-d for 3Γ smaller indexes with virtually no quality loss, or 64-d for 12Γ compression while still beating FashionSigLIP.
| Dim | Bytes/vec | Fine R@1 | vs FashionSigLIP-768 | Index (1M imgs) |
|---|---|---|---|---|
| 768 | 3,072 | 67.21 | +3.37 | 2.93 GB |
| 512 | 2,048 | 66.75 | +2.91 | 1.95 GB |
| 384 | 1,536 | 67.07 | +3.23 | 1.46 GB |
| 256 | 1,024 | 67.42 | +3.58 | 977 MB |
| 128 | 512 | 66.23 | +2.39 | 488 MB |
| 64 | 256 | 64.05 | +0.21 | 244 MB |
| Precision | Bytes/vec | Fine R@1 | Index (1M imgs) |
|---|---|---|---|
| fp32 | 1,024 | 67.16 | 977 MB |
| fp16 | 512 | 67.08 | 488 MB |
| int8 | 256 | 67.16 | 244 MB |
| binary+rerank | 32+512 | 67.29 | ~519 MB |
| binary (pure) | 32 | 63.50 | 30 MB |
| Property | Value |
|---|---|
| Architecture | ViT-B/16-SigLIP (full CLIP: vision + text) |
| Parameters | 203.2M |
| Embedding Dimension | 768 (full) or 512 / 384 / 256 / 128 / 64 |
| Recommended Dimension | 256 (sweet spot: best R@1, 3x smaller index) |
| Output | L2-normalized float32 vector |
| Model Size (safetensors) | ~775 MB |
| Input Resolution | 224 Γ 224 |
| Framework | OpenCLIP |
| Precision | float32 |
A standalone inference.py is included in this directory. It supports dimension selection and a full sweep.
# Default: encode at 256-d (recommended)
python inference.py --image query.jpg
# Specify dimension
python inference.py --image query.jpg --dim 128
# Two images + cosine similarity at 256-d
python inference.py --image img1.jpg img2.jpg --dim 256 --similarity
# Sweep all dimensions (64 β 768) with index cost
python inference.py --image img1.jpg --sweep
# Run on GPU/MPS
python inference.py --image query.jpg --device cuda
import open_clip
import torch
import torch.nn.functional as F
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP",
pretrained="path/to/moda-fashion-matryoshka/open_clip_model.safetensors",
)
model.eval()
image = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
full_emb = model.encode_image(image) # [1, 768]
dim = 256
emb_256 = F.normalize(full_emb[:, :dim], dim=-1) # [1, 256]
import numpy as np
emb = model.encode_image(images)[:, :256]
emb = F.normalize(emb, dim=-1).numpy()
emb_min = emb.min(axis=1, keepdims=True)
emb_max = emb.max(axis=1, keepdims=True)
emb_int8 = np.round((emb - emb_min) / (emb_max - emb_min + 1e-8) * 255).astype(np.uint8)
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
| Model | Dim | Fine R@1 | Best for |
|---|---|---|---|
| MODA-Fashion-Distilled | 768 | 67.63 | Best overall quality |
| MODA-Fashion-Matryoshka (this model) | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| MODA-Fashion-Vision-FP16 | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| MODA-Fashion-Distilled-512d | 512 | 67.63 | Compact index, highest nDCG@5 |
| MODA-Fashion-DeepFashion2 | 768 | 66.52 | Simplest recipe, no distillation |
MIT
If you use this model, please cite:
@software{moda2026,
title = {MODA: Open-source benchmark and models for fashion search},
author = {Hopit AI},
year = {2026},
url = {https://github.com/hopit-ai/Moda}
}