--- library_name: onnx tags: - clip - multimodal - visual-search - zero-shot-classification - onnx - inference4j license: mit datasets: - openai/clip-training-data --- # CLIP ViT-B/32 — ONNX (Vision + Text Encoders) ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) split into separate vision and text encoder models for independent use. Converted for use with [inference4j](https://github.com/inference4j/inference4j), an inference-only AI library for Java. ## Usage with inference4j ### Visual search (image-text similarity) ```java try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build(); ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) { float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile())); float[] textEmb = textEncoder.encode("a photo of a cat"); float similarity = dot(imageEmb, textEmb); } ``` ### Zero-shot classification ```java float[] imageEmb = imageEncoder.encode(photo); String[] labels = {"cat", "dog", "bird", "car"}; float bestScore = Float.NEGATIVE_INFINITY; String bestLabel = null; for (String label : labels) { float score = dot(imageEmb, textEncoder.encode("a photo of a " + label)); if (score > bestScore) { bestScore = score; bestLabel = label; } } ``` ## Files | File | Description | Size | |------|-------------|------| | `vision_model.onnx` | Vision encoder (ViT-B/32) | ~340 MB | | `text_model.onnx` | Text encoder (Transformer) | ~255 MB | | `vocab.json` | BPE vocabulary (49408 tokens) | ~1.6 MB | | `merges.txt` | BPE merge rules (48894 merges) | ~1.7 MB | ## Model Details | Property | Value | |----------|-------| | Architecture | ViT-B/32 (vision) + Transformer (text) | | Embedding dim | 512 | | Max text length | 77 tokens | | Image input | `[batch, 3, 224, 224]` — CLIP-normalized | | Text input | `input_ids` + `attention_mask` `[batch, 77]` | | ONNX opset | 17 | ## Preprocessing ### Vision 1. Resize to 224×224 (bicubic) 2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`, std=`[0.26862954, 0.26130258, 0.27577711]` 3. NCHW layout: `[1, 3, 224, 224]` ### Text 1. Byte-level BPE tokenization using `vocab.json` + `merges.txt` 2. Add `<|startoftext|>` (49406) and `<|endoftext|>` (49407) 3. Pad/truncate to 77 tokens ## Original Paper > Radford, A., Kim, J. W., Hallacy, C., et al. (2021). > Learning Transferable Visual Models From Natural Language Supervision. > ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020) ## License The original CLIP model is released under the MIT License by OpenAI.