--- pipeline_tag: zero-shot-image-classification tags: - open_clip - clip - vision - image-text-retrieval - laion400m --- # ViT-B-32 OpenCLIP Model on LAION-400M This is a ViT-B-32 model trained using [OpenCLIP](https://github.com/mlfoundations/open_clip) on the LAION-400M dataset. ## Training Details The model was trained with the following configuration: - **Model Architecture**: ViT-B-32 - **Dataset**: LAION-400M - **Number of Samples**: 400M (~ 268,836,185 filtered samples used) - **Hardware**: 2 Nodes, each with 4 H200 141GB GPUs (Total 8 GPUs) - **Batch Size (per GPU)**: 4096 - **Precision**: `amp_bfloat16` - **Total Epochs**: 32 - **Warmup Steps**: 2000 Additional specific performance-enhancing flags enabled during training: `--torchcompile`, `--local-loss`, and `--gather-with-grad`. ## Evaluation - **Eval Epoch**: 0 - **imagenet-zeroshot-val-top1**: 0.6086 - **imagenet-zeroshot-val-top5**: 0.8632 ## Usage ```python import torch import open_clip # Load the model directly from huggingface model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='hf-hub:lingkai/open-clip') tokenizer = open_clip.get_tokenizer('ViT-B-32') # Example inference image = preprocess(Image.open("astronaut.png")).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat"]) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ```