Zero-Shot Classification
ONNX
clip
multimodal
visual-search
inference4j
vccarvalho11's picture
Export CLIP ViT-B/32 vision and text encoders as separate ONNX models
88fd91c verified
---
library_name: onnx
tags:
- clip
- multimodal
- visual-search
- zero-shot-classification
- onnx
- inference4j
license: mit
datasets:
- openai/clip-training-data
---
# CLIP ViT-B/32 — ONNX (Vision + Text Encoders)
ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
split into separate vision and text encoder models for independent use.
Converted for use with [inference4j](https://github.com/inference4j/inference4j),
an inference-only AI library for Java.
## Usage with inference4j
### Visual search (image-text similarity)
```java
try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {
float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
float[] textEmb = textEncoder.encode("a photo of a cat");
float similarity = dot(imageEmb, textEmb);
}
```
### Zero-shot classification
```java
float[] imageEmb = imageEncoder.encode(photo);
String[] labels = {"cat", "dog", "bird", "car"};
float bestScore = Float.NEGATIVE_INFINITY;
String bestLabel = null;
for (String label : labels) {
float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
if (score > bestScore) {
bestScore = score;
bestLabel = label;
}
}
```
## Files
| File | Description | Size |
|------|-------------|------|
| `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB |
| `text_model.onnx` | Text encoder (Transformer) | ~255 MB |
| `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB |
| `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB |
## Model Details
| Property | Value |
|----------|-------|
| Architecture | ViT-B/32 (vision) + Transformer (text) |
| Embedding dim | 512 |
| Max text length | 77 tokens |
| Image input | `[batch, 3, 224, 224]` — CLIP-normalized |
| Text input | `input_ids` + `attention_mask` `[batch, 77]` |
| ONNX opset | 17 |
## Preprocessing
### Vision
1. Resize to 224×224 (bicubic)
2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`,
std=`[0.26862954, 0.26130258, 0.27577711]`
3. NCHW layout: `[1, 3, 224, 224]`
### Text
1. Byte-level BPE tokenization using `vocab.json` + `merges.txt`
2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407)
3. Pad/truncate to 77 tokens
## Original Paper
> Radford, A., Kim, J. W., Hallacy, C., et al. (2021).
> Learning Transferable Visual Models From Natural Language Supervision.
> ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020)
## License
The original CLIP model is released under the MIT License by OpenAI.