Export CLIP ViT-B/32 vision and text encoders as separate ONNX models

88fd91c verified 3 days ago

2.67 kB

	---
	library_name: onnx
	tags:
	- clip
	- multimodal
	- visual-search
	- zero-shot-classification
	- onnx
	- inference4j
	license: mit
	datasets:
	- openai/clip-training-data
	---

	# CLIP ViT-B/32 — ONNX (Vision + Text Encoders)

	ONNX export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
	split into separate vision and text encoder models for independent use.

	Converted for use with [inference4j](https://github.com/inference4j/inference4j),
	an inference-only AI library for Java.

	## Usage with inference4j

	### Visual search (image-text similarity)

	```java
	try (ClipImageEncoder imageEncoder = ClipImageEncoder.builder().build();
	ClipTextEncoder textEncoder = ClipTextEncoder.builder().build()) {

	float[] imageEmb = imageEncoder.encode(ImageIO.read(Path.of("photo.jpg").toFile()));
	float[] textEmb = textEncoder.encode("a photo of a cat");

	float similarity = dot(imageEmb, textEmb);
	}
	```

	### Zero-shot classification

	```java
	float[] imageEmb = imageEncoder.encode(photo);
	String[] labels = {"cat", "dog", "bird", "car"};

	float bestScore = Float.NEGATIVE_INFINITY;
	String bestLabel = null;
	for (String label : labels) {
	float score = dot(imageEmb, textEncoder.encode("a photo of a " + label));
	if (score > bestScore) {
	bestScore = score;
	bestLabel = label;
	}
	}
	```

	## Files

	\| File \| Description \| Size \|
	\|------\|-------------\|------\|
	\| `vision_model.onnx` \| Vision encoder (ViT-B/32) \| ~340 MB \|
	\| `text_model.onnx` \| Text encoder (Transformer) \| ~255 MB \|
	\| `vocab.json` \| BPE vocabulary (49408 tokens) \| ~1.6 MB \|
	\| `merges.txt` \| BPE merge rules (48894 merges) \| ~1.7 MB \|

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| ViT-B/32 (vision) + Transformer (text) \|
	\| Embedding dim \| 512 \|
	\| Max text length \| 77 tokens \|
	\| Image input \| `[batch, 3, 224, 224]` — CLIP-normalized \|
	\| Text input \| `input_ids` + `attention_mask` `[batch, 77]` \|
	\| ONNX opset \| 17 \|

	## Preprocessing

	### Vision
	1. Resize to 224×224 (bicubic)
	2. CLIP normalization: mean=`[0.48145466, 0.4578275, 0.40821073]`,
	std=`[0.26862954, 0.26130258, 0.27577711]`
	3. NCHW layout: `[1, 3, 224, 224]`

	### Text
	1. Byte-level BPE tokenization using `vocab.json` + `merges.txt`
	2. Add `<\|startoftext\|>` (49406) and `<\|endoftext\|>` (49407)
	3. Pad/truncate to 77 tokens

	## Original Paper

	> Radford, A., Kim, J. W., Hallacy, C., et al. (2021).
	> Learning Transferable Visual Models From Natural Language Supervision.
	> ICML 2021. [arXiv:2103.00020](https://arxiv.org/abs/2103.00020)

	## License

	The original CLIP model is released under the MIT License by OpenAI.