metadata
library_name: onnx
tags:
- clip
- multimodal
- visual-search
- zero-shot-classification
- onnx
- inference4j
license: mit
datasets:
- openai/clip-training-data
CLIP ViT-B/32 — ONNX (Vision + Text Encoders)
ONNX export of openai/clip-vit-base-patch32 split into separate vision and text encoder models for independent use.
Converted for use with inference4j, an inference-only AI library for Java.
Usage with inference4j
Visual search (image-text similarity)
try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {
float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
float[] textEmb = textEncoder.encode("a photo of a cat");
float similarity = dot(imageEmb, textEmb);
}
Zero-shot classification
float[] imageEmb = imageEncoder.encode(photo);
String[] labels = {"cat", "dog", "bird", "car"};
float bestScore = Float.NEGATIVE_INFINITY;
String bestLabel = null;
for (String label : labels) {
float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
if (score > bestScore) {
bestScore = score;
bestLabel = label;
}
}
Files
| File | Description | Size |
|---|---|---|
vision_model.onnx |
Vision encoder (ViT-B/32) | ~340 MB |
text_model.onnx |
Text encoder (Transformer) | ~255 MB |
vocab.json |
BPE vocabulary (49408 tokens) | ~1.6 MB |
merges.txt |
BPE merge rules (48894 merges) | ~1.7 MB |
Model Details
| Property | Value |
|---|---|
| Architecture | ViT-B/32 (vision) + Transformer (text) |
| Embedding dim | 512 |
| Max text length | 77 tokens |
| Image input | [batch, 3, 224, 224] — CLIP-normalized |
| Text input | input_ids + attention_mask [batch, 77] |
| ONNX opset | 17 |
Preprocessing
Vision
- Resize to 224×224 (bicubic)
- CLIP normalization: mean=
[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711] - NCHW layout:
[1, 3, 224, 224]
Text
- Byte-level BPE tokenization using
vocab.json+merges.txt - Add
<|startoftext|>(49406) and<|endoftext|>(49407) - Pad/truncate to 77 tokens
Original Paper
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020
License
The original CLIP model is released under the MIT License by OpenAI.