ruclip-vit-large-patch14-336-onnx

ruclip-vit-large-patch14-336

RuCLIP (Russian Contrastive Language–Image Pretraining) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and multimodal learning.

Model was trained by Sber AI and SberDevices teams.

Task: text ranking; image ranking; zero-shot image classification;
Type: encoder
Num Parameters: 430M
Training Data Volume: 240 million text-image pairs
Language: Russian
Context Length: 77
Transformer Layers: 12
Transformer Width: 768
Transformer Heads: 12
Image Size: 336
Vision Layers: 24
Vision Width: 1024
Vision Patch Size: 14

Files

File	Purpose
`visual.onnx`	Vision encoder — input `[N,3,336,336]` float32, output `[N,768]`
`textual.onnx`	Text encoder — input `[N,77]` int64, output `[N,768]`
`visual_int8.onnx`	Vision encoder INT8, full quantization (max speed)
`textual_int8.onnx`	Text encoder INT8, full quantization (max speed)
`visual_int8_no_conv.onnx`	Vision INT8 without ConvInteger (e.g. ONNX Runtime in Rust)
`textual_int8_no_conv.onnx`	Text INT8 without ConvInteger (e.g. ONNX Runtime in Rust)
`bpe.model`	Original tokenizer (YouTokenToMe). Use with Python/ruclip.
`tokenizer.json`	Same vocab in Hugging Face format — for runtimes without YouTokenToMe support
`config.json`	Hyperparameters (mean/std, resolution)
`preprocessing.md`	Image and text preprocessing specification
`inference_spec.json`	Machine-readable model spec (CI/CD, tooling)

The *_no_conv.onnx variants avoid the ConvInteger op for runtimes that do not implement it (e.g. ort in Rust). Same I/O as other INT8 models.

Usage

ONNX

Image: normalized [N,3,336,336] (see preprocessing.md). Text: [N,77] int64 (BOS + token_ids + EOS, pad to 77). Special IDs: pad=0, unk=1, bos=2, eos=3.

import onnxruntime as ort

v = ort.InferenceSession("visual.onnx", providers=["CPUExecutionProvider"])
t = ort.InferenceSession("textual.onnx", providers=["CPUExecutionProvider"])
img_emb = v.run(None, {v.get_inputs()[0].name: image})[0]
txt_emb = t.run(None, {t.get_inputs()[0].name: text})[0]

Tokenization: use bpe.model (Python/YouTokenToMe) or tokenizer.json (Hugging Face tokenizers). INT8: same I/O; use *_int8.onnx for max speed, or *_int8_no_conv.onnx where ConvInteger isn’t supported (e.g. ort in Rust).

For contributors (advanced)

Scripts to regenerate or modify models. Not needed for inference.

Script	Purpose
`convert_to_onnx.py`	PyTorch → ONNX export
`verify_onnx.py`	Sanity check ONNX outputs
`export_tokenizer_json.py`	`bpe.model` → `tokenizer.json`
`compare_tokenizers.py`	Verify tokenizer parity
`quantize_visual_model.py`	`visual.onnx` → `visual_int8.onnx` + `visual_int8_no_conv.onnx`
`quantize_text_model.py`	`textual.onnx` → `textual_int8.onnx` + `textual_int8_no_conv.onnx`

Use *_int8.onnx where ConvInteger is supported; use *_int8_no_conv.onnx for runtimes that don’t (e.g. ort in Rust).

Requires: pytorch_model.bin (for conversion), requirements-convert.txt. See script docstrings for usage.

Performance

Zero-shot metrics from the original RuCLIP paper (PyTorch). ONNX outputs are numerically equivalent.

Dataset	Metric Name	Metric Result
Food101	acc	0.712
CIFAR10	acc	0.906
CIFAR100	acc	0.591
Birdsnap	acc	0.213
SUN397	acc	0.523
Stanford Cars	acc	0.659
DTD	acc	0.408
MNIST	acc	0.242
STL10	acc	0.956
PCam	acc	0.554
CLEVR	acc	0.142
Rendered SST2	acc	0.539
ImageNet	acc	0.488
FGVC Aircraft	mean-per-class	0.075
Oxford Pets	mean-per-class	0.546
Caltech101	mean-per-class	0.835
Flowers102	mean-per-class	0.517
HatefulMemes	roc-auc	0.519

Authors

RuCLIP: Alex Shonenkov, Daniil Chesakov, Denis Dimitrov, Igor Pavlov — ai-forever/ru-clip.
ONNX conversion: Tim Tkachev.

Downloads last month: 8

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support