--- license: apache-2.0 pipeline_tag: zero-shot-image-classification tags: - datology - clip - vision - OpenCLIP - datacomp - image-text-retrieval - multimodal --- # DatologyAI CLIP Retrieval Optimized ViT-B/32 **DatologyAI CLIP Retrieval** is a state-of-the-art contrastive vision-language model optimized for image-text retrieval tasks through advanced data curation. This retrieval-optimized ViT-B/32 model achieves competitive performance with SigLIP2 while requiring significantly less compute. ## Model Description DatologyAI's retrieval-optimized CLIP model demonstrates superior performance on retrieval benchmarks through targeted data curation strategies: - **State-of-the-art MSCOCO performance** for ViT-B/32 models - **2x training efficiency** compared to SigLIP2 - Optimized for text-based distribution alignment - Standard CLIP architecture with retrieval-focused data curation ## Intended Uses This model is optimized for image-text retrieval tasks, cross-modal search, and multimodal understanding applications. ### Image-to-Text Retrieval ```python import torch from PIL import Image import open_clip # Load model and preprocessing model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32') tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32') # Load and process image image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0) # Define text candidates texts = [ "a photo of a cat", "a dog playing in the park", "a beautiful sunset over the ocean", "people walking in a city" ] text_tokens = tokenizer(texts) # Compute similarities with torch.no_grad(): image_features = model.encode_image(image) text_features = model.encode_text(text_tokens) # Normalize features image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) # Calculate similarity similarity = (100.0 * image_features @ text_features.T) # Get top matches values, indices = similarity[0].topk(len(texts)) for idx, score in zip(indices, values): print(f"{texts[idx]}: {score.item():.2f}") ``` ### Text-to-Image Retrieval ```python import torch import open_clip from typing import List def retrieve_images(query: str, image_features: torch.Tensor, top_k: int = 5): """ Retrieve top-k images for a text query Args: query: Text description to search for image_features: Pre-computed normalized image features [N, 512] top_k: Number of images to retrieve """ # Encode text query text_tokens = tokenizer([query]) with torch.no_grad(): text_features = model.encode_text(text_tokens) text_features /= text_features.norm(dim=-1, keepdim=True) # Compute similarities similarities = (100.0 * text_features @ image_features.T).squeeze() # Get top-k matches values, indices = similarities.topk(top_k) return indices.tolist(), values.tolist() # Example usage model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/retr-opt-vit-b-32') tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/retr-opt-vit-b-32') # Pre-compute image features for your dataset # image_features = ... # Shape: [num_images, 512] # Search for images indices, scores = retrieve_images("a red sports car", image_features) ``` ## Training Procedure DatologyAI's retrieval-optimized pipeline employs specialized curation techniques: 1. **Text-aligned distribution matching** - Prioritizes alignment along text representations for retrieval tasks 2. **Retrieval-specific synthetic data** - Optimized caption generation for cross-modal understanding 3. **Balanced multimodal representation** - Ensures strong performance in both directions The model uses standard CLIP contrastive objectives without architectural modifications. ## Training Data The model was trained on image-text pairs curated from the **DataComp-XL** dataset using DatologyAI's retrieval-optimized curation pipeline, selecting high-quality pairs that enhance cross-modal alignment. ## Evaluation Results ### Retrieval Performance | Benchmark | Metric | DatologyAI | SigLIP2 | MetaCLIP | |-----------|--------|------------|---------|----------| | **MSCOCO** | Retrieval@1 | 55.53% | 55.45% | 46.6% | | **Flickr30K** | Retrieval@1 | 79.7% | 82.4% | 72.9% | ### Training Efficiency - Matches SigLIP2 MSCOCO performance with **50% fewer samples** (20B vs 40B) - Exceeds MetaCLIP by >5% absolute on both benchmarks ## Model Details - **Developed by:** DatologyAI - **Model type:** CLIP (Contrastive Language-Image Pre-training) - **Architecture:** Vision Transformer B/32 - **License:** Apache 2.0 - **Training framework:** OpenCLIP 2.24.0 - **Optimization focus:** Image-text retrieval ## Technical Specifications ### Model Architecture - **Vision Encoder:** ViT-B/32 (86M parameters) - Patch size: 32×32 - Image size: 224×224 - Embedding dimension: 512 - **Text Encoder:** 12-layer Transformer - Context length: 77 tokens - Vocabulary size: 49,408 (BPE tokenizer) ### Training Configuration - **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6) - **Learning rate:** 1e-3 with cosine schedule - **Weight decay:** 0.1 - **Batch size:** 32,768 - **Training approach:** Retrieval-optimized data curation - **Hardware:** Distributed training on H100 GPUs ## Usage Tips 1. **Feature Caching**: For large-scale retrieval, pre-compute and cache image features 2. **Batch Processing**: Process multiple queries simultaneously for efficiency 3. **Normalization**: Always normalize features before computing similarities 4. **Temperature Scaling**: Adjust similarity temperature for different use cases ## Citation If you use this model, please cite: ```bibtex @article{datologyai2025clip, title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only}, author={DatologyAI Team}, journal={DatologyAI Blog}, year={2025}, url={https://datologyai.com/blog/clip-data-upgrade} } ``` ## Additional Information For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade). **Contact:** [team@datologyai.com](mailto:team@datologyai.com) ## Model Card Contact DatologyAI Team - [team@datologyai.com](mailto:team@datologyai.com)