Zero-Shot Classification
ONNX
clip
multimodal
visual-search
inference4j

CLIP ViT-B/32 — ONNX (Vision + Text Encoders)

ONNX export of openai/clip-vit-base-patch32 split into separate vision and text encoder models for independent use.

Converted for use with inference4j, an inference-only AI library for Java.

Usage with inference4j

Visual search (image-text similarity)

try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
     ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {

    float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
    float[] textEmb = textEncoder.encode("a photo of a cat");

    float similarity = dot(imageEmb, textEmb);
}

Zero-shot classification

float[] imageEmb = imageEncoder.encode(photo);
String[] labels = {"cat", "dog", "bird", "car"};

float bestScore = Float.NEGATIVE_INFINITY;
String bestLabel = null;
for (String label : labels) {
    float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
    if (score > bestScore) {
        bestScore = score;
        bestLabel = label;
    }
}

Files

File Description Size
vision_model.onnx Vision encoder (ViT-B/32) ~340 MB
text_model.onnx Text encoder (Transformer) ~255 MB
vocab.json BPE vocabulary (49408 tokens) ~1.6 MB
merges.txt BPE merge rules (48894 merges) ~1.7 MB

Model Details

Property Value
Architecture ViT-B/32 (vision) + Transformer (text)
Embedding dim 512
Max text length 77 tokens
Image input [batch, 3, 224, 224] — CLIP-normalized
Text input input_ids + attention_mask [batch, 77]
ONNX opset 17

Preprocessing

Vision

  1. Resize to 224×224 (bicubic)
  2. CLIP normalization: mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]
  3. NCHW layout: [1, 3, 224, 224]

Text

  1. Byte-level BPE tokenization using vocab.json + merges.txt
  2. Add <|startoftext|> (49406) and <|endoftext|> (49407)
  3. Pad/truncate to 77 tokens

Original Paper

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020

License

The original CLIP model is released under the MIT License by OpenAI.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for inference4j/clip-vit-base-patch32