Learning Transferable Visual Models From Natural Language Supervision
Paper
•
2103.00020
•
Published
•
19
ONNX export of openai/clip-vit-base-patch32 split into separate vision and text encoder models for independent use.
Converted for use with inference4j, an inference-only AI library for Java.
try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {
float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
float[] textEmb = textEncoder.encode("a photo of a cat");
float similarity = dot(imageEmb, textEmb);
}
float[] imageEmb = imageEncoder.encode(photo);
String[] labels = {"cat", "dog", "bird", "car"};
float bestScore = Float.NEGATIVE_INFINITY;
String bestLabel = null;
for (String label : labels) {
float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
if (score > bestScore) {
bestScore = score;
bestLabel = label;
}
}
| File | Description | Size |
|---|---|---|
vision_model.onnx |
Vision encoder (ViT-B/32) | ~340 MB |
text_model.onnx |
Text encoder (Transformer) | ~255 MB |
vocab.json |
BPE vocabulary (49408 tokens) | ~1.6 MB |
merges.txt |
BPE merge rules (48894 merges) | ~1.7 MB |
| Property | Value |
|---|---|
| Architecture | ViT-B/32 (vision) + Transformer (text) |
| Embedding dim | 512 |
| Max text length | 77 tokens |
| Image input | [batch, 3, 224, 224] — CLIP-normalized |
| Text input | input_ids + attention_mask [batch, 77] |
| ONNX opset | 17 |
[0.48145466, 0.4578275, 0.40821073],
std=[0.26862954, 0.26130258, 0.27577711][1, 3, 224, 224]vocab.json + merges.txt<|startoftext|> (49406) and <|endoftext|> (49407)Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. arXiv:2103.00020
The original CLIP model is released under the MIT License by OpenAI.