Model Card - CLIP Zero-Shot

Overview

The CLIP (ViT-B/32) model is used off-the-shelf for zero-shot vibe matching.
It maps user-entered movie-review text and outfit images into a shared embedding space and ranks outfits by cosine similarity (vibe alignment).

Model Details

Field	Description
Developed by	Bareethul Kader & Nada Khan
Framework	Hugging Face Transformers
Base Model	openai/clip-vit-base-patch32
Repository	bareethulk/Forma
License	MIT (OpenAI CLIP)

Intended Use

Direct Use

Zero-shot text–image matching for outfit recommendations.
Core engine of the Gradio demo app.

Out-of-Scope Use

Not fine-tuned for specific fashion styles.
May inherit biases from large-scale web data.

Dataset

Evaluation on nadakandrew/closet_multimodal_v1
Paired image–text inputs for vibe ranking.

Evaluation Setup

Mode: Zero-shot classification + ranking
Metric Space: Cosine similarity (512-D)
Results:
- Accuracy: 91 %
- Precision@5: 1.00
- NDCG@5: 0.96
- MRR: 0.95

Interpretation: CLIP outperforms the trained ResNet18 (48 %) by a large margin, highlighting the power of pre-trained vision–language models for vibe alignment.

Limitations / Ethical Notes

May reproduce biases from web data.
Does not capture deep emotional context behind reviews.
Research / educational use only.

Downloads last month: 3