cls-opt-vit-b-32 / README.md

Update README.md

b41ea7a verified 8 months ago

5.3 kB

	---
	license: apache-2.0
	pipeline_tag: zero-shot-image-classification
	tags:
	- datology
	- clip
	- vision
	- OpenCLIP
	- datacomp
	- zero-shot-classification
	---

	# DatologyAI CLIP Classification Optimized ViT-B/32

	DatologyAI CLIP is a state-of-the-art contrastive vision-language model that achieves superior performance through advanced data curation alone, without any architectural or training modifications. This classification-optimized ViT-B/32 model outperforms SigLIP2, MetaCLIP, and DFN on zero-shot classification benchmarks.

	## Model Description

	DatologyAI's CLIP model demonstrates that careful data curation can drive state-of-the-art performance without modifications to model architecture or training paradigms. Key achievements include:

	- 76.91% ImageNet1k accuracy (vs 74.0% for SigLIP2)
	- 8x training efficiency compared to standard approaches
	- Trained on 13B curated image-text pairs from DataComp
	- Standard CLIP architecture and training procedure

	## Intended Uses

	You can use this model for zero-shot image classification or as a vision encoder for VLMs and other vision tasks.

	### Zero-shot Image Classification

	```python
	import torch
	from PIL import Image
	import open_clip

	# Load model and preprocessing
	model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
	tokenizer = open_clip.get_tokenizer('hf-hub:DatologyAI/cls-opt-vit-b-32')

	# Load image
	image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

	# Define candidate labels
	labels = ["a dog", "a cat", "a bird"]
	text = tokenizer(labels)

	# Run inference
	with torch.no_grad():
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)

	# Normalize features
	image_features /= image_features.norm(dim=-1, keepdim=True)
	text_features /= text_features.norm(dim=-1, keepdim=True)

	# Calculate similarity
	similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

	# Get predictions
	values, indices = similarity[0].topk(3)
	for value, index in zip(values, indices):
	print(f"{labels[index]}: {value.item():.2%}")
	```

	### Image Encoding

	```python
	import torch
	from PIL import Image
	import open_clip

	# Load model
	model, _, preprocess = open_clip.create_model_and_transforms('hf-hub:DatologyAI/cls-opt-vit-b-32')
	model.eval()

	# Process image
	image = preprocess(Image.open("path/to/image.jpg")).unsqueeze(0)

	# Extract features
	with torch.no_grad():
	image_features = model.encode_image(image)

	print(f"Feature shape: {image_features.shape}") # [1, 512]
	```

	## Training Procedure

	DatologyAI's training pipeline focuses on sophisticated data curation techniques including:

	1. Improved target distribution matching - Task-specific alignment of image features for classification
	2. Enhanced synthetic data generation - Optimized caption generation for classification tasks
	3. Predictive metrics for curation quality - Rapid iteration without full model training

	The model uses standard CLIP training objectives with no architectural modifications.

	## Training Data

	The model was trained on 13B image-text (multi-epoch) curated from the DataComp-XL dataset using DatologyAI's proprietary curation pipeline. The curation process selected high-quality, classification-relevant subsets from the 10B available pairs in DataComp-XL.

	## Evaluation Results

	### Zero-shot Classification Performance

	\| Benchmark \| DatologyAI \| SigLIP2 \| MetaCLIP \|
	\|-----------\|------------\|---------\|----------\|
	\| ImageNet1k \| 76.91% \| 74.0% \| 67.7% \|
	\| ImageNetv2 \| 70.2% \| 67.1% \| 60.4% \|

	### Training Efficiency
	- Matches SigLIP2 performance with only 5B samples (87.5% compute reduction)
	- Matches MetaCLIP performance with only 1B samples (92% compute reduction)

	Full details see [blog post]().

	## Model Details

	- Developed by: DatologyAI
	- Model type: CLIP (Contrastive Language-Image Pre-training)
	- Architecture: Vision Transformer B/32
	- License: Apache 2.0
	- Training framework: OpenCLIP 2.24.0

	## Technical Specifications

	### Model Architecture
	- Vision Encoder: ViT-B/32 (86M parameters)
	- Patch size: 32×32
	- Image size: 224×224
	- Embedding dimension: 512
	- Text Encoder: 12-layer Transformer
	- Context length: 77 tokens
	- Vocabulary size: 49,408 (BPE tokenizer)

	### Training Configuration
	- Optimizer: AdamW (β1=0.9, β2=0.98, ε=1e-6)
	- Learning rate: 5.0e-04 with cosine schedule
	- Weight decay: 0.1
	- Batch size: 32,768
	- Training samples: 13B image-text pairs
	- Hardware: Distributed training on H100 GPUs

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{datologyai2025clip,
	title={CLIP Gets a Data Upgrade: Outperforming SoTA with Improved Data Curation Only},
	author={DatologyAI Team},
	journal={DatologyAI Blog},
	year={2025},
	url={https://datologyai.com/blog/clip-data-upgrade}
	}
	```

	## Additional Information

	For more details on our data curation methodology and comprehensive benchmark results, please visit our [blog post](https://datologyai.com/blog/clip-data-upgrade).

	Contact: [team@datologyai.com](mailto:team@datologyai.com)

	## Model Card Contact

	DatologyAI Team - [team@datologyai.com](mailto:team@datologyai.com)