Export CLIP ViT-B/32 vision and text encoders as separate ONNX models
Browse files- README.md +96 -0
- text_model.onnx +3 -0
- vision_model.onnx +3 -0
README.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: onnx
|
| 3 |
+
tags:
|
| 4 |
+
- clip
|
| 5 |
+
- multimodal
|
| 6 |
+
- visual-search
|
| 7 |
+
- zero-shot-classification
|
| 8 |
+
- onnx
|
| 9 |
+
- inference4j
|
| 10 |
+
license: mit
|
| 11 |
+
datasets:
|
| 12 |
+
- openai/clip-training-data
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# CLIP ViT-B/32 — ONNX (Vision + Text Encoders)
|
| 16 |
+
|
| 17 |
+
ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
|
| 18 |
+
split into separate vision and text encoder models for independent use.
|
| 19 |
+
|
| 20 |
+
Converted for use with [inference4j](https://github.com/inference4j/inference4j),
|
| 21 |
+
an inference-only AI library for Java.
|
| 22 |
+
|
| 23 |
+
## Usage with inference4j
|
| 24 |
+
|
| 25 |
+
### Visual search (image-text similarity)
|
| 26 |
+
|
| 27 |
+
```java
|
| 28 |
+
try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
|
| 29 |
+
ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {
|
| 30 |
+
|
| 31 |
+
float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
|
| 32 |
+
float[] textEmb = textEncoder.encode("a photo of a cat");
|
| 33 |
+
|
| 34 |
+
float similarity = dot(imageEmb, textEmb);
|
| 35 |
+
}
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
### Zero-shot classification
|
| 39 |
+
|
| 40 |
+
```java
|
| 41 |
+
float[] imageEmb = imageEncoder.encode(photo);
|
| 42 |
+
String[] labels = {"cat", "dog", "bird", "car"};
|
| 43 |
+
|
| 44 |
+
float bestScore = Float.NEGATIVE_INFINITY;
|
| 45 |
+
String bestLabel = null;
|
| 46 |
+
for (String label : labels) {
|
| 47 |
+
float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
|
| 48 |
+
if (score > bestScore) {
|
| 49 |
+
bestScore = score;
|
| 50 |
+
bestLabel = label;
|
| 51 |
+
}
|
| 52 |
+
}
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Files
|
| 56 |
+
|
| 57 |
+
| File | Description | Size |
|
| 58 |
+
|------|-------------|------|
|
| 59 |
+
| `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB |
|
| 60 |
+
| `text_model.onnx` | Text encoder (Transformer) | ~255 MB |
|
| 61 |
+
| `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB |
|
| 62 |
+
| `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB |
|
| 63 |
+
|
| 64 |
+
## Model Details
|
| 65 |
+
|
| 66 |
+
| Property | Value |
|
| 67 |
+
|----------|-------|
|
| 68 |
+
| Architecture | ViT-B/32 (vision) + Transformer (text) |
|
| 69 |
+
| Embedding dim | 512 |
|
| 70 |
+
| Max text length | 77 tokens |
|
| 71 |
+
| Image input | `[batch, 3, 224, 224]` — CLIP-normalized |
|
| 72 |
+
| Text input | `input_ids` + `attention_mask` `[batch, 77]` |
|
| 73 |
+
| ONNX opset | 17 |
|
| 74 |
+
|
| 75 |
+
## Preprocessing
|
| 76 |
+
|
| 77 |
+
### Vision
|
| 78 |
+
1. Resize to 224×224 (bicubic)
|
| 79 |
+
2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`,
|
| 80 |
+
std=`[0.26862954, 0.26130258, 0.27577711]`
|
| 81 |
+
3. NCHW layout: `[1, 3, 224, 224]`
|
| 82 |
+
|
| 83 |
+
### Text
|
| 84 |
+
1. Byte-level BPE tokenization using `vocab.json` + `merges.txt`
|
| 85 |
+
2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407)
|
| 86 |
+
3. Pad/truncate to 77 tokens
|
| 87 |
+
|
| 88 |
+
## Original Paper
|
| 89 |
+
|
| 90 |
+
> Radford, A., Kim, J. W., Hallacy, C., et al. (2021).
|
| 91 |
+
> Learning Transferable Visual Models From Natural Language Supervision.
|
| 92 |
+
> ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020)
|
| 93 |
+
|
| 94 |
+
## License
|
| 95 |
+
|
| 96 |
+
The original CLIP model is released under the MIT License by OpenAI.
|
text_model.onnx
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ec9cfa29fc10a5c6dd5e7efd3b4aab7351a1961a134721622e5eaa57dd44981f
|
| 3 |
+
size 253812304
|
vision_model.onnx
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e8518f8b64e5abdd0bdb44e336a4e1367f48803c5a60462baea32dc0d18a2fd7
|
| 3 |
+
size 351484834
|