ruclip-vit-large-patch14-336

RuCLIP (Russian Contrastive Language–Image Pretraining) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and multimodal learning.

Model was trained by Sber AI and SberDevices teams.

  • Task: text ranking; image ranking; zero-shot image classification;
  • Type: encoder
  • Num Parameters: 430M
  • Training Data Volume: 240 million text-image pairs
  • Language: Russian
  • Context Length: 77
  • Transformer Layers: 12
  • Transformer Width: 768
  • Transformer Heads: 12
  • Image Size: 336
  • Vision Layers: 24
  • Vision Width: 1024
  • Vision Patch Size: 14

Files

File Purpose
visual.onnx Vision encoder — input [N,3,336,336] float32, output [N,768]
textual.onnx Text encoder — input [N,77] int64, output [N,768]
visual_int8.onnx Vision encoder INT8, full quantization (max speed)
textual_int8.onnx Text encoder INT8, full quantization (max speed)
visual_int8_no_conv.onnx Vision INT8 without ConvInteger (e.g. ONNX Runtime in Rust)
textual_int8_no_conv.onnx Text INT8 without ConvInteger (e.g. ONNX Runtime in Rust)
bpe.model Original tokenizer (YouTokenToMe). Use with Python/ruclip.
tokenizer.json Same vocab in Hugging Face format — for runtimes without YouTokenToMe support
config.json Hyperparameters (mean/std, resolution)
preprocessing.md Image and text preprocessing specification
inference_spec.json Machine-readable model spec (CI/CD, tooling)

The *_no_conv.onnx variants avoid the ConvInteger op for runtimes that do not implement it (e.g. ort in Rust). Same I/O as other INT8 models.

Usage

ONNX

Image: normalized [N,3,336,336] (see preprocessing.md). Text: [N,77] int64 (BOS + token_ids + EOS, pad to 77). Special IDs: pad=0, unk=1, bos=2, eos=3.

import onnxruntime as ort

v = ort.InferenceSession("visual.onnx", providers=["CPUExecutionProvider"])
t = ort.InferenceSession("textual.onnx", providers=["CPUExecutionProvider"])
img_emb = v.run(None, {v.get_inputs()[0].name: image})[0]
txt_emb = t.run(None, {t.get_inputs()[0].name: text})[0]

Tokenization: use bpe.model (Python/YouTokenToMe) or tokenizer.json (Hugging Face tokenizers). INT8: same I/O; use *_int8.onnx for max speed, or *_int8_no_conv.onnx where ConvInteger isn’t supported (e.g. ort in Rust).

For contributors (advanced)

Scripts to regenerate or modify models. Not needed for inference.

Script Purpose
convert_to_onnx.py PyTorch → ONNX export
verify_onnx.py Sanity check ONNX outputs
export_tokenizer_json.py bpe.modeltokenizer.json
compare_tokenizers.py Verify tokenizer parity
quantize_visual_model.py visual.onnxvisual_int8.onnx + visual_int8_no_conv.onnx
quantize_text_model.py textual.onnxtextual_int8.onnx + textual_int8_no_conv.onnx

Use *_int8.onnx where ConvInteger is supported; use *_int8_no_conv.onnx for runtimes that don’t (e.g. ort in Rust).

Requires: pytorch_model.bin (for conversion), requirements-convert.txt. See script docstrings for usage.

Performance

Zero-shot metrics from the original RuCLIP paper (PyTorch). ONNX outputs are numerically equivalent.

Dataset Metric Name Metric Result
Food101 acc 0.712
CIFAR10 acc 0.906
CIFAR100 acc 0.591
Birdsnap acc 0.213
SUN397 acc 0.523
Stanford Cars acc 0.659
DTD acc 0.408
MNIST acc 0.242
STL10 acc 0.956
PCam acc 0.554
CLEVR acc 0.142
Rendered SST2 acc 0.539
ImageNet acc 0.488
FGVC Aircraft mean-per-class 0.075
Oxford Pets mean-per-class 0.546
Caltech101 mean-per-class 0.835
Flowers102 mean-per-class 0.517
HatefulMemes roc-auc 0.519

Authors

RuCLIP: Alex Shonenkov, Daniil Chesakov, Denis Dimitrov, Igor Pavlov — ai-forever/ru-clip.
ONNX conversion: Tim Tkachev.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support