clip-vit-base-patch32 / README.md

vccarvalho11

Export CLIP ViT-B/32 vision and text encoders as separate ONNX models

88fd91c verified 2 days ago

preview code

raw

history blame contribute delete

2.67 kB

metadata

library_name: onnx
tags:
  - clip
  - multimodal
  - visual-search
  - zero-shot-classification
  - onnx
  - inference4j
license: mit
datasets:
  - openai/clip-training-data

CLIP ViT-B/32 — ONNX (Vision + Text Encoders)

ONNX export of openai/clip-vit-base-patch32 split into separate vision and text encoder models for independent use.

Converted for use with inference4j, an inference-only AI library for Java.

Usage with inference4j

Visual search (image-text similarity)

try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
     ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {

    float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
    float[] textEmb = textEncoder.encode("a photo of a cat");

    float similarity = dot(imageEmb, textEmb);
}

Zero-shot classification

float[] imageEmb = imageEncoder.encode(photo);
String[] labels = {"cat", "dog", "bird", "car"};

float bestScore = Float.NEGATIVE_INFINITY;
String bestLabel = null;
for (String label : labels) {
    float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
    if (score > bestScore) {
        bestScore = score;
        bestLabel = label;
    }
}

Files

File	Description	Size
`vision_model.onnx`	Vision encoder (ViT-B/32)	~340 MB
`text_model.onnx`	Text encoder (Transformer)	~255 MB
`vocab.json`	BPE vocabulary (49408 tokens)	~1.6 MB
`merges.txt`	BPE merge rules (48894 merges)	~1.7 MB

Model Details

Property	Value
Architecture	ViT-B/32 (vision) + Transformer (text)
Embedding dim	512
Max text length	77 tokens
Image input	`[batch, 3, 224, 224]` — CLIP-normalized
Text input	`input_ids` + `attention_mask` `[batch, 77]`
ONNX opset	17

Preprocessing

Vision

Resize to 224×224 (bicubic)
CLIP normalization: mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]
NCHW layout: [1, 3, 224, 224]

Text

Byte-level BPE tokenization using vocab.json + merges.txt
Add <|startoftext|> (49406) and <|endoftext|> (49407)
Pad/truncate to 77 tokens

Original Paper

Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020

License

The original CLIP model is released under the MIT License by OpenAI.