Zero-Shot Classification
ONNX
clip
multimodal
visual-search
inference4j
vccarvalho11 commited on
Commit
88fd91c
·
verified ·
1 Parent(s): 56399b7

Export CLIP ViT-B/32 vision and text encoders as separate ONNX models

Browse files
Files changed (3) hide show
  1. README.md +96 -0
  2. text_model.onnx +3 -0
  3. vision_model.onnx +3 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: onnx
3
+ tags:
4
+ - clip
5
+ - multimodal
6
+ - visual-search
7
+ - zero-shot-classification
8
+ - onnx
9
+ - inference4j
10
+ license: mit
11
+ datasets:
12
+ - openai/clip-training-data
13
+ ---
14
+
15
+ # CLIP ViT-B/32 — ONNX (Vision + Text Encoders)
16
+
17
+ ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
18
+ split into separate vision and text encoder models for independent use.
19
+
20
+ Converted for use with [inference4j](https://github.com/inference4j/inference4j),
21
+ an inference-only AI library for Java.
22
+
23
+ ## Usage with inference4j
24
+
25
+ ### Visual search (image-text similarity)
26
+
27
+ ```java
28
+ try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
29
+ ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {
30
+
31
+ float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
32
+ float[] textEmb = textEncoder.encode("a photo of a cat");
33
+
34
+ float similarity = dot(imageEmb, textEmb);
35
+ }
36
+ ```
37
+
38
+ ### Zero-shot classification
39
+
40
+ ```java
41
+ float[] imageEmb = imageEncoder.encode(photo);
42
+ String[] labels = {"cat", "dog", "bird", "car"};
43
+
44
+ float bestScore = Float.NEGATIVE_INFINITY;
45
+ String bestLabel = null;
46
+ for (String label : labels) {
47
+ float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
48
+ if (score > bestScore) {
49
+ bestScore = score;
50
+ bestLabel = label;
51
+ }
52
+ }
53
+ ```
54
+
55
+ ## Files
56
+
57
+ | File | Description | Size |
58
+ |------|-------------|------|
59
+ | `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB |
60
+ | `text_model.onnx` | Text encoder (Transformer) | ~255 MB |
61
+ | `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB |
62
+ | `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB |
63
+
64
+ ## Model Details
65
+
66
+ | Property | Value |
67
+ |----------|-------|
68
+ | Architecture | ViT-B/32 (vision) + Transformer (text) |
69
+ | Embedding dim | 512 |
70
+ | Max text length | 77 tokens |
71
+ | Image input | `[batch, 3, 224, 224]` — CLIP-normalized |
72
+ | Text input | `input_ids` + `attention_mask` `[batch, 77]` |
73
+ | ONNX opset | 17 |
74
+
75
+ ## Preprocessing
76
+
77
+ ### Vision
78
+ 1. Resize to 224×224 (bicubic)
79
+ 2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`,
80
+ std=`[0.26862954, 0.26130258, 0.27577711]`
81
+ 3. NCHW layout: `[1, 3, 224, 224]`
82
+
83
+ ### Text
84
+ 1. Byte-level BPE tokenization using `vocab.json` + `merges.txt`
85
+ 2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407)
86
+ 3. Pad/truncate to 77 tokens
87
+
88
+ ## Original Paper
89
+
90
+ > Radford, A., Kim, J. W., Hallacy, C., et al. (2021).
91
+ > Learning Transferable Visual Models From Natural Language Supervision.
92
+ > ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020)
93
+
94
+ ## License
95
+
96
+ The original CLIP model is released under the MIT License by OpenAI.
text_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec9cfa29fc10a5c6dd5e7efd3b4aab7351a1961a134721622e5eaa57dd44981f
3
+ size 253812304
vision_model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8518f8b64e5abdd0bdb44e336a4e1367f48803c5a60462baea32dc0d18a2fd7
3
+ size 351484834