MODA-Fashion-Vision-FP16
Compressed vision-only encoder for fast fashion image-to-image retrieval β 4.2x smaller, same quality.
This is the vision tower extracted from MODA-Fashion-Distilled, converted to FP16 half-precision. It strips the unused text encoder (image-to-image tasks never use it) and halves the weight precision, reducing model size from 775 MB to 186 MB with only -0.21 pp quality loss on LookBench.
Key Numbers
| Property | Value |
|---|---|
| Architecture | ViT-B/16-SigLIP (vision tower only) |
| Parameters | 92.9M (vs 203M full CLIP) |
| Precision | float16 |
| Model Size | 186 MB (vs 775 MB full CLIP) |
| Embedding Dim | 768 |
| Input Resolution | 224 x 224 |
| LookBench Fine R@1 | 67.42% (full model: 67.63%) |
LookBench Results (Fine Recall@1)
| Variant | Params | Size | RealStudio | AIGenStudio | RealStreet | AIGenStreet | Overall |
|---|---|---|---|---|---|---|---|
| MODA-Distilled (full CLIP) | 203M | 775 MB | 70.23 | 80.31 | 60.24 | 81.25 | 67.63 |
| MODA-Vision-FP16 (this) | 92.9M | 186 MB | 70.13 | 80.83 | 59.73 | 81.25 | 67.42 |
| FashionSigLIP baseline | 203M | 775 MB | 66.96 | 76.68 | 56.37 | 74.38 | 63.84 |
Inference β Quick Start
A standalone inference.py is included in this directory.
# Single image β 768-d embedding
python inference.py --image query.jpg
# Two images β embeddings + cosine similarity
python inference.py --image img1.jpg img2.jpg --similarity
# Run on GPU (keeps FP16 precision for speed)
python inference.py --image query.jpg --device cuda
Python API
import torch
import open_clip
import torch.nn.functional as F
from safetensors.torch import load_file
from PIL import Image
# Build the ViT-B-16-SigLIP architecture without downloading any pretrained weights.
# The text tower is randomly initialized (we never use it). Only the visual tower
# is overwritten with MODA's fine-tuned weights below. Suppresses the ~775 MB
# pretrained-checkpoint download that would otherwise come with hf-hub:Marqo/marqo-fashionSigLIP.
base_model, _, preprocess = open_clip.create_model_and_transforms(
"ViT-B-16-SigLIP", pretrained=None
)
# Load MODA's vision-only fp16 weights (186 MB) and overlay onto the visual tower.
vision_sd = load_file("path/to/moda-fashion-vision-fp16/vision_encoder.safetensors")
vision_sd_fp32 = {k: v.float() for k, v in vision_sd.items()}
full_sd = base_model.state_dict()
for k, v in vision_sd_fp32.items():
full_sd[k] = v
base_model.load_state_dict(full_sd, strict=True)
encoder = base_model.visual
encoder.eval()
image = preprocess(Image.open("fashion_item.jpg")).unsqueeze(0)
with torch.no_grad():
features = encoder(image)
features = F.normalize(features, p=2, dim=-1) # [1, 768]
Requirements
open_clip_torch>=2.20.0
torch>=2.0
Pillow
safetensors
Why Vision-Only?
For image-to-image retrieval (the primary use case), the text encoder is never used. Stripping it provides:
- 54% fewer parameters (92.9M vs 203M) β faster forward pass
- 4.2x smaller on disk (186 MB vs 775 MB) β cheaper to deploy
- Near-zero quality loss (-0.21 pp Fine R@1) β within noise margin
Training
This model inherits its weights from MODA-Fashion-Distilled, which was trained via:
- Knowledge distillation from a 10-model ensemble (SigLIP, CLIP, EVA-CLIP, MetaCLIP, DFN variants)
- Stratified contrastive learning on Google Shopping data (10K image pairs, category-balanced)
- Vision-only export β text tower removed
- FP16 conversion β weights cast from float32 to float16
Related Models
| Model | Dim | Fine R@1 | Best for |
|---|---|---|---|
| MODA-Fashion-Distilled | 768 | 67.63 | Best overall quality |
| MODA-Fashion-Matryoshka | 64-768 | 67.42 (256d) | Flexible dim, 3x smaller index |
| MODA-Fashion-Vision-FP16 (this model) | 768 | 67.42 | Smallest (186 MB), edge/mobile |
| MODA-Fashion-Distilled-512d | 512 | 67.63 | Compact index, highest nDCG@5 |
| MODA-Fashion-DeepFashion2 | 768 | 66.52 | Simplest recipe, no distillation |
License
MIT
Citation
If you use this model, please cite:
@software{moda2026,
title = {MODA: Open-source benchmark and models for fashion search},
author = {Hopit AI},
year = {2026},
url = {https://github.com/hopit-ai/Moda}
}
- Downloads last month
- 75
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support