|
|
--- |
|
|
library_name: onnx |
|
|
tags: |
|
|
- clip |
|
|
- multimodal |
|
|
- visual-search |
|
|
- zero-shot-classification |
|
|
- onnx |
|
|
- inference4j |
|
|
license: mit |
|
|
datasets: |
|
|
- openai/clip-training-data |
|
|
--- |
|
|
|
|
|
# CLIP ViT-B/32 — ONNX (Vision + Text Encoders) |
|
|
|
|
|
ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) |
|
|
split into separate vision and text encoder models for independent use. |
|
|
|
|
|
Converted for use with [inference4j](https://github.com/inference4j/inference4j), |
|
|
an inference-only AI library for Java. |
|
|
|
|
|
## Usage with inference4j |
|
|
|
|
|
### Visual search (image-text similarity) |
|
|
|
|
|
```java |
|
|
try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build(); |
|
|
ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) { |
|
|
|
|
|
float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile())); |
|
|
float[] textEmb = textEncoder.encode("a photo of a cat"); |
|
|
|
|
|
float similarity = dot(imageEmb, textEmb); |
|
|
} |
|
|
``` |
|
|
|
|
|
### Zero-shot classification |
|
|
|
|
|
```java |
|
|
float[] imageEmb = imageEncoder.encode(photo); |
|
|
String[] labels = {"cat", "dog", "bird", "car"}; |
|
|
|
|
|
float bestScore = Float.NEGATIVE_INFINITY; |
|
|
String bestLabel = null; |
|
|
for (String label : labels) { |
|
|
float score = dot(imageEmb, textEncoder.encode("a photo of a " + label)); |
|
|
if (score > bestScore) { |
|
|
bestScore = score; |
|
|
bestLabel = label; |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Files |
|
|
|
|
|
| File | Description | Size | |
|
|
|------|-------------|------| |
|
|
| `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB | |
|
|
| `text_model.onnx` | Text encoder (Transformer) | ~255 MB | |
|
|
| `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB | |
|
|
| `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB | |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| Architecture | ViT-B/32 (vision) + Transformer (text) | |
|
|
| Embedding dim | 512 | |
|
|
| Max text length | 77 tokens | |
|
|
| Image input | `[batch, 3, 224, 224]` — CLIP-normalized | |
|
|
| Text input | `input_ids` + `attention_mask` `[batch, 77]` | |
|
|
| ONNX opset | 17 | |
|
|
|
|
|
## Preprocessing |
|
|
|
|
|
### Vision |
|
|
1. Resize to 224×224 (bicubic) |
|
|
2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`, |
|
|
std=`[0.26862954, 0.26130258, 0.27577711]` |
|
|
3. NCHW layout: `[1, 3, 224, 224]` |
|
|
|
|
|
### Text |
|
|
1. Byte-level BPE tokenization using `vocab.json` + `merges.txt` |
|
|
2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407) |
|
|
3. Pad/truncate to 77 tokens |
|
|
|
|
|
## Original Paper |
|
|
|
|
|
> Radford, A., Kim, J. W., Hallacy, C., et al. (2021). |
|
|
> Learning Transferable Visual Models From Natural Language Supervision. |
|
|
> ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020) |
|
|
|
|
|
## License |
|
|
|
|
|
The original CLIP model is released under the MIT License by OpenAI. |
|
|
|