retr-opt-vit-b-32 / README.md
ricardomonti08's picture
Update README.md
b34e7fc verified
---
license: apache-2.0
pipeline_tag: zero-shot-image-classification
tags:
- datology
- clip
- vision
- OpenCLIP
- datacomp
- image-text-retrieval
- multimodal
---
# DatologyAI CLIP Retrieval Optimized ViT-B/32
**DatologyAI CLIP Retrieval** is a state-of-the-art contrastive vision-language model optimized for image-text retrieval tasks through advanced data curation. This retrieval-optimized ViT-B/32 model achieves competitive performance with SigLIP2 while requiring significantly less compute.
## Model Description
DatologyAI's retrieval-optimized CLIP model demonstrates superior performance on retrieval benchmarks through targeted data curation strategies:
- **State-of-the-art MSCOCO performance** for ViT-B/32 models
- **2x training efficiency** compared to SigLIP2
- Optimized for text-based distribution alignment
- Standard CLIP architecture with retrieval-focused data curation
## Intended Uses
This model is optimized for image-text retrieval tasks, cross-modal search, and multimodal understanding applications.
### Image-to-Text Retrieval
```python
import torch
from PIL import Image
import open_clip
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')
# Load and process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)
# Define text candidates
texts = [
"a photo of a cat",
"a dog playing in the park",
"a beautiful sunset over the ocean",
"people walking in a city"
]
text_tokens = tokenizer(texts)
# Compute similarities
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
# Normalize features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Calculate similarity
similarity = (100.0 * image_features @ text_features.T)
# Get top matches
values, indices = similarity[0].topk(len(texts))
for idx, score in zip(indices, values):
print(f"{texts[idx]}: {score.item():.2f}")
```
### Text-to-Image Retrieval
```python
import torch
import open_clip
from typing import List
def retrieve_images(query: str, image_features: torch.Tensor, top_k: int = 5):
"""
Retrieve top-k images for a text query
Args:
query: Text description to search for
image_features: Pre-computed normalized image features [N, 512]
top_k: Number of images to retrieve
"""
# Encode text query
text_tokens = tokenizer([query])
with torch.no_grad():
text_features = model.encode_text(text_tokens)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Compute similarities
similarities = (100.0 * text_features @ image_features.T).squeeze()
# Get top-k matches
values, indices = similarities.topk(top_k)
return indices.tolist(), values.tolist()
# Example usage
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32')
# Pre-compute image features for your dataset
# image_features = ... # Shape: [num_images, 512]
# Search for images
indices, scores = retrieve_images("a red sports car", image_features)
```
## Training Procedure
DatologyAI's retrieval-optimized pipeline employs specialized curation techniques:
1. **Text-aligned distribution matching** - Prioritizes alignment along text representations for retrieval tasks
2. **Retrieval-specific synthetic data** - Optimized caption generation for cross-modal understanding
3. **Balanced multimodal representation** - Ensures strong performance in both directions
The model uses standard CLIP contrastive objectives without architectural modifications.
## Training Data
The model was trained on image-text pairs curated from the **DataComp-XL** dataset using DatologyAI's retrieval-optimized curation pipeline, selecting high-quality pairs that enhance cross-modal alignment.
## Evaluation Results
### Retrieval Performance
| Benchmark | Metric | DatologyAI | SigLIP2 | MetaCLIP |
|-----------|--------|------------|---------|----------|
| **MSCOCO** | Retrieval@1 | 55.53% | 55.45% | 46.6% |
| **Flickr30K** | Retrieval@1 | 79.7% | 82.4% | 72.9% |
### Training Efficiency
- Matches SigLIP2 MSCOCO performance with **50% fewer samples** (20B vs 40B)
- Exceeds MetaCLIP by >5% absolute on both benchmarks
## Model Details
- **Developed by:** DatologyAI
- **Model type:** CLIP (Contrastive Language-Image Pre-training)
- **Architecture:** Vision Transformer B/32
- **License:** Apache 2.0
- **Training framework:** OpenCLIP 2.24.0
- **Optimization focus:** Image-text retrieval
## Technical Specifications
### Model Architecture
- **Vision Encoder:** ViT-B/32 (86M parameters)
- Patch size: 32×32
- Image size: 224×224
- Embedding dimension: 512
- **Text Encoder:** 12-layer Transformer
- Context length: 77 tokens
- Vocabulary size: 49,408 (BPE tokenizer)
### Training Configuration
- **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6)
- **Learning rate:** 1e-3 with cosine schedule
- **Weight decay:** 0.1
- **Batch size:** 32,768
- **Training approach:** Retrieval-optimized data curation
- **Hardware:** Distributed training on H100 GPUs
## Usage Tips
1. **Feature Caching**: For large-scale retrieval, pre-compute and cache image features
2. **Batch Processing**: Process multiple queries simultaneously for efficiency
3. **Normalization**: Always normalize features before computing similarities
4. **Temperature Scaling**: Adjust similarity temperature for different use cases
## Citation
If you use this model, please cite:
```bibtex
@article{datologyai2025clip,
title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
author={DatologyAI Team},
journal={DatologyAI Blog},
year={2025},
url={https://datologyai.com/blog/clip-data-upgrade}
}
```
## Additional Information
For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade).
**Contact:** [team@datologyai.com](mailto:team@datologyai.com)
## Model Card Contact
DatologyAI Team - [team@datologyai.com](mailto:team@datologyai.com)