cls-opt-vit-b-32 / README.md
rads101's picture
Update README.md
b41ea7a verified
---
license: apache-2.0
pipeline_tag: zero-shot-image-classification
tags:
- datology
- clip
- vision
- OpenCLIP
- datacomp
- zero-shot-classification
---
# DatologyAI CLIP Classification Optimized ViT-B/32
**DatologyAI CLIP** is a state-of-the-art contrastive vision-language model that achieves superior performance through advanced data curation alone, without any architectural or training modifications. This classification-optimized ViT-B/32 model outperforms SigLIP2, MetaCLIP, and DFN on zero-shot classification benchmarks.
## Model Description
DatologyAI's CLIP model demonstrates that careful data curation can drive state-of-the-art performance without modifications to model architecture or training paradigms. Key achievements include:
- **76.91% ImageNet1k accuracy** (vs 74.0% for SigLIP2)
- **8x training efficiency** compared to standard approaches
- Trained on 13B curated image-text pairs from DataComp
- Standard CLIP architecture and training procedure
## Intended Uses
You can use this model for zero-shot image classification or as a vision encoder for VLMs and other vision tasks.
### Zero-shot Image Classification
```python
import torch
from PIL import Image
import open_clip
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/cls-opt-vit-b-32')
# Load image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)
# Define candidate labels
labels = ["a dog", "a cat", "a bird"]
text = tokenizer(labels)
# Run inference
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize features
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Calculate similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
# Get predictions
values, indices = similarity[0].topk(3)
for value, index in zip(values, indices):
print(f"{labels[index]}: {value.item():.2%}")
```
### Image Encoding
```python
import torch
from PIL import Image
import open_clip
# Load model
model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
model.eval()
# Process image
image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)
# Extract features
with torch.no_grad():
image_features = model.encode_image(image)
print(f"Feature shape: {image_features.shape}") # [1, 512]
```
## Training Procedure
DatologyAI's training pipeline focuses on sophisticated data curation techniques including:
1. **Improved target distribution matching** - Task-specific alignment of image features for classification
2. **Enhanced synthetic data generation** - Optimized caption generation for classification tasks
3. **Predictive metrics for curation quality** - Rapid iteration without full model training
The model uses standard CLIP training objectives with no architectural modifications.
## Training Data
The model was trained on 13B image-text (multi-epoch) curated from the **DataComp-XL** dataset using DatologyAI's proprietary curation pipeline. The curation process selected high-quality, classification-relevant subsets from the 10B available pairs in DataComp-XL.
## Evaluation Results
### Zero-shot Classification Performance
| Benchmark | DatologyAI | SigLIP2 | MetaCLIP |
|-----------|------------|---------|----------|
| ImageNet1k | **76.91%** | 74.0% | 67.7% |
| ImageNetv2 | **70.2%** | 67.1% | 60.4% |
### Training Efficiency
- Matches SigLIP2 performance with only **5B samples** (87.5% compute reduction)
- Matches MetaCLIP performance with only **1B samples** (92% compute reduction)
Full details see [blog post]().
## Model Details
- **Developed by:** DatologyAI
- **Model type:** CLIP (Contrastive Language-Image Pre-training)
- **Architecture:** Vision Transformer B/32
- **License:** Apache 2.0
- **Training framework:** OpenCLIP 2.24.0
## Technical Specifications
### Model Architecture
- **Vision Encoder:** ViT-B/32 (86M parameters)
- Patch size: 32×32
- Image size: 224×224
- Embedding dimension: 512
- **Text Encoder:** 12-layer Transformer
- Context length: 77 tokens
- Vocabulary size: 49,408 (BPE tokenizer)
### Training Configuration
- **Optimizer:** AdamW (β1=0.9, β2=0.98, ε=1e-6)
- **Learning rate:** 5.0e-04 with cosine schedule
- **Weight decay:** 0.1
- **Batch size:** 32,768
- **Training samples:** 13B image-text pairs
- **Hardware:** Distributed training on H100 GPUs
## Citation
If you use this model, please cite:
```bibtex
@article{datologyai2025clip,
title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
author={DatologyAI Team},
journal={DatologyAI Blog},
year={2025},
url={https://datologyai.com/blog/clip-data-upgrade}
}
```
## Additional Information
For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade).
**Contact:** [team@datologyai.com](mailto:team@datologyai.com)
## Model Card Contact
DatologyAI Team - [team@datologyai.com](mailto:team@datologyai.com)