Zero-Shot Classification
ONNX
clip
multimodal
visual-search
inference4j
File size: 2,666 Bytes
88fd91c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
library_name: onnx
tags:
  - clip
  - multimodal
  - visual-search
  - zero-shot-classification
  - onnx
  - inference4j
license: mit
datasets:
  - openai/clip-training-data
---

# CLIP ViT-B/32 — ONNX (Vision + Text Encoders)

ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
split into separate vision and text encoder models for independent use.

Converted for use with [inference4j](https://github.com/inference4j/inference4j),
an inference-only AI library for Java.

## Usage with inference4j

### Visual search (image-text similarity)

```java
try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
     ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {

    float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
    float[] textEmb = textEncoder.encode("a photo of a cat");

    float similarity = dot(imageEmb, textEmb);
}
```

### Zero-shot classification

```java
float[] imageEmb = imageEncoder.encode(photo);
String[] labels = {"cat", "dog", "bird", "car"};

float bestScore = Float.NEGATIVE_INFINITY;
String bestLabel = null;
for (String label : labels) {
    float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
    if (score > bestScore) {
        bestScore = score;
        bestLabel = label;
    }
}
```

## Files

| File | Description | Size |
|------|-------------|------|
| `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB |
| `text_model.onnx` | Text encoder (Transformer) | ~255 MB |
| `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB |
| `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB |

## Model Details

| Property | Value |
|----------|-------|
| Architecture | ViT-B/32 (vision) + Transformer (text) |
| Embedding dim | 512 |
| Max text length | 77 tokens |
| Image input | `[batch, 3, 224, 224]` — CLIP-normalized |
| Text input | `input_ids` + `attention_mask` `[batch, 77]` |
| ONNX opset | 17 |

## Preprocessing

### Vision
1. Resize to 224×224 (bicubic)
2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`,
   std=`[0.26862954, 0.26130258, 0.27577711]`
3. NCHW layout: `[1, 3, 224, 224]`

### Text
1. Byte-level BPE tokenization using `vocab.json` + `merges.txt`
2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407)
3. Pad/truncate to 77 tokens

## Original Paper

> Radford, A., Kim, J. W., Hallacy, C., et al. (2021).
> Learning Transferable Visual Models From Natural Language Supervision.
> ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020)

## License

The original CLIP model is released under the MIT License by OpenAI.