| | --- |
| | license: apache-2.0 |
| | pipeline_tag: zero-shot-image-classification |
| | tags: |
| | - datology |
| | - clip |
| | - vision |
| | - OpenCLIP |
| | - datacomp |
| | - image-text-retrieval |
| | - multimodal |
| | --- |
| | |
| | # DatologyAI CLIP Retrieval Optimized ViT-B/32 |
| |
|
| | **DatologyAI CLIP Retrieval** is a state-of-the-art contrastive vision-language model optimized for image-text retrieval tasks through advanced data curation. This retrieval-optimized ViT-B/32 model achieves competitive performance with SigLIP2 while requiring significantly less compute. |
| |
|
| | ## Model Description |
| |
|
| | DatologyAI's retrieval-optimized CLIP model demonstrates superior performance on retrieval benchmarks through targeted data curation strategies: |
| |
|
| | - **State-of-the-art MSCOCO performance** for ViT-B/32 models |
| | - **2x training efficiency** compared to SigLIP2 |
| | - Optimized for text-based distribution alignment |
| | - Standard CLIP architecture with retrieval-focused data curation |
| |
|
| | ## Intended Uses |
| |
|
| | This model is optimized for image-text retrieval tasks, cross-modal search, and multimodal understanding applications. |
| |
|
| | ### Image-to-Text Retrieval |
| |
|
| | ```python |
| | import torch |
| | from PIL import Image |
| | import open_clip |
| | |
| | # Load model and preprocessing |
| | model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32') |
| | tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32') |
| | |
| | # Load and process image |
| | image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0) |
| | |
| | # Define text candidates |
| | texts = [ |
| | "a photo of a cat", |
| | "a dog playing in the park", |
| | "a beautiful sunset over the ocean", |
| | "people walking in a city" |
| | ] |
| | text_tokens = tokenizer(texts) |
| | |
| | # Compute similarities |
| | with torch.no_grad(): |
| | image_features = model.encode_image(image) |
| | text_features = model.encode_text(text_tokens) |
| | |
| | # Normalize features |
| | image_features /= image_features.norm(dim=-1, keepdim=True) |
| | text_features /= text_features.norm(dim=-1, keepdim=True) |
| | |
| | # Calculate similarity |
| | similarity = (100.0 * image_features @ text_features.T) |
| | |
| | # Get top matches |
| | values, indices = similarity[0].topk(len(texts)) |
| | for idx, score in zip(indices, values): |
| | print(f"{texts[idx]}: {score.item():.2f}") |
| | ``` |
| |
|
| | ### Text-to-Image Retrieval |
| |
|
| | ```python |
| | import torch |
| | import open_clip |
| | from typing import List |
| | |
| | def retrieve_images(query: str, image_features: torch.Tensor, top_k: int = 5): |
| | """ |
| | Retrieve top-k images for a text query |
| | |
| | Args: |
| | query: Text description to search for |
| | image_features: Pre-computed normalized image features [N, 512] |
| | top_k: Number of images to retrieve |
| | """ |
| | # Encode text query |
| | text_tokens = tokenizer([query]) |
| | with torch.no_grad(): |
| | text_features = model.encode_text(text_tokens) |
| | text_features /= text_features.norm(dim=-1, keepdim=True) |
| | |
| | # Compute similarities |
| | similarities = (100.0 * text_features @ image_features.T).squeeze() |
| | |
| | # Get top-k matches |
| | values, indices = similarities.topk(top_k) |
| | return indices.tolist(), values.tolist() |
| | |
| | # Example usage |
| | model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32') |
| | tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32') |
| | |
| | # Pre-compute image features for your dataset |
| | # image_features = ... # Shape: [num_images, 512] |
| | |
| | # Search for images |
| | indices, scores = retrieve_images("a red sports car", image_features) |
| | ``` |
| |
|
| | ## Training Procedure |
| |
|
| | DatologyAI's retrieval-optimized pipeline employs specialized curation techniques: |
| |
|
| | 1. **Text-aligned distribution matching** - Prioritizes alignment along text representations for retrieval tasks |
| | 2. **Retrieval-specific synthetic data** - Optimized caption generation for cross-modal understanding |
| | 3. **Balanced multimodal representation** - Ensures strong performance in both directions |
| |
|
| | The model uses standard CLIP contrastive objectives without architectural modifications. |
| |
|
| | ## Training Data |
| |
|
| | The model was trained on image-text pairs curated from the **DataComp-XL** dataset using DatologyAI's retrieval-optimized curation pipeline, selecting high-quality pairs that enhance cross-modal alignment. |
| |
|
| | ## Evaluation Results |
| |
|
| | ### Retrieval Performance |
| |
|
| | | Benchmark | Metric | DatologyAI | SigLIP2 | MetaCLIP | |
| | |-----------|--------|------------|---------|----------| |
| | | **MSCOCO** | Retrieval@1 | 55.53% | 55.45% | 46.6% | |
| | | **Flickr30K** | Retrieval@1 | 79.7% | 82.4% | 72.9% | |
| |
|
| | ### Training Efficiency |
| | - Matches SigLIP2 MSCOCO performance with **50% fewer samples** (20B vs 40B) |
| | - Exceeds MetaCLIP by >5% absolute on both benchmarks |
| |
|
| | ## Model Details |
| |
|
| | - **Developed by:** DatologyAI |
| | - **Model type:** CLIP (Contrastive Language-Image Pre-training) |
| | - **Architecture:** Vision Transformer B/32 |
| | - **License:** Apache 2.0 |
| | - **Training framework:** OpenCLIP 2.24.0 |
| | - **Optimization focus:** Image-text retrieval |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Model Architecture |
| | - **Vision Encoder:** ViT-B/32 (86M parameters) |
| | - Patch size: 32×32 |
| | - Image size: 224×224 |
| | - Embedding dimension: 512 |
| | - **Text Encoder:** 12-layer Transformer |
| | - Context length: 77 tokens |
| | - Vocabulary size: 49,408 (BPE tokenizer) |
| |
|
| | ### Training Configuration |
| | - **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6) |
| | - **Learning rate:** 1e-3 with cosine schedule |
| | - **Weight decay:** 0.1 |
| | - **Batch size:** 32,768 |
| | - **Training approach:** Retrieval-optimized data curation |
| | - **Hardware:** Distributed training on H100 GPUs |
| |
|
| | ## Usage Tips |
| |
|
| | 1. **Feature Caching**: For large-scale retrieval, pre-compute and cache image features |
| | 2. **Batch Processing**: Process multiple queries simultaneously for efficiency |
| | 3. **Normalization**: Always normalize features before computing similarities |
| | 4. **Temperature Scaling**: Adjust similarity temperature for different use cases |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @article{datologyai2025clip, |
| | title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only}, |
| | author={DatologyAI Team}, |
| | journal={DatologyAI Blog}, |
| | year={2025}, |
| | url={https://datologyai.com/blog/clip-data-upgrade} |
| | } |
| | ``` |
| |
|
| | ## Additional Information |
| |
|
| | For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade). |
| |
|
| | **Contact:** [team@datologyai.com](mailto:team@datologyai.com) |
| |
|
| | ## Model Card Contact |
| |
|
| | DatologyAI Team - [team@datologyai.com](mailto:team@datologyai.com) |