Export CLIP ViT-B/32 vision and text encoders as separate ONNX models

Browse files

Files changed (3) hide show

README.md +96 -0
text_model.onnx +3 -0
vision_model.onnx +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,96 @@

+---
+library_name: onnx
+tags:
+  - clip
+  - multimodal
+  - visual-search
+  - zero-shot-classification
+  - onnx
+  - inference4j
+license: mit
+datasets:
+  - openai/clip-training-data
+---
+# CLIP ViT-B/32 — ONNX (Vision + Text Encoders)
+ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
+split into separate vision and text encoder models for independent use.
+Converted for use with [inference4j](https://github.com/inference4j/inference4j),
+an inference-only AI library for Java.
+## Usage with inference4j
+### Visual search (image-text similarity)
+```java
+try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
+     ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {
+    float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
+    float[] textEmb = textEncoder.encode("a photo of a cat");
+    float similarity = dot(imageEmb, textEmb);
+}
+```
+### Zero-shot classification
+```java
+float[] imageEmb = imageEncoder.encode(photo);
+String[] labels = {"cat", "dog", "bird", "car"};
+float bestScore = Float.NEGATIVE_INFINITY;
+String bestLabel = null;
+for (String label : labels) {
+    float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
+    if (score > bestScore) {
+        bestScore = score;
+        bestLabel = label;
+    }
+}
+```
+## Files
+| File | Description | Size |
+|------|-------------|------|
+| `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB |
+| `text_model.onnx` | Text encoder (Transformer) | ~255 MB |
+| `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB |
+| `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB |
+## Model Details
+| Property | Value |
+|----------|-------|
+| Architecture | ViT-B/32 (vision) + Transformer (text) |
+| Embedding dim | 512 |
+| Max text length | 77 tokens |
+| Image input | `[batch, 3, 224, 224]` — CLIP-normalized |
+| Text input | `input_ids` + `attention_mask` `[batch, 77]` |
+| ONNX opset | 17 |
+## Preprocessing
+### Vision
+1. Resize to 224×224 (bicubic)
+2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`,
+   std=`[0.26862954, 0.26130258, 0.27577711]`
+3. NCHW layout: `[1, 3, 224, 224]`
+### Text
+1. Byte-level BPE tokenization using `vocab.json` + `merges.txt`
+2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407)
+3. Pad/truncate to 77 tokens
+## Original Paper
+> Radford, A., Kim, J. W., Hallacy, C., et al. (2021).
+> Learning Transferable Visual Models From Natural Language Supervision.
+> ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020)
+## License
+The original CLIP model is released under the MIT License by OpenAI.

text_model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec9cfa29fc10a5c6dd5e7efd3b4aab7351a1961a134721622e5eaa57dd44981f
+size 253812304

vision_model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e8518f8b64e5abdd0bdb44e336a4e1367f48803c5a60462baea32dc0d18a2fd7
+size 351484834