ruclip-vit-large-patch14-336
RuCLIP (Russian Contrastive Language–Image Pretraining) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and multimodal learning.
Model was trained by Sber AI and SberDevices teams.
- Task:
text ranking;image ranking;zero-shot image classification; - Type:
encoder - Num Parameters:
430M - Training Data Volume:
240 million text-image pairs - Language:
Russian - Context Length:
77 - Transformer Layers:
12 - Transformer Width:
768 - Transformer Heads:
12 - Image Size:
336 - Vision Layers:
24 - Vision Width:
1024 - Vision Patch Size:
14
Files
| File | Purpose |
|---|---|
visual.onnx |
Vision encoder — input [N,3,336,336] float32, output [N,768] |
textual.onnx |
Text encoder — input [N,77] int64, output [N,768] |
visual_int8.onnx |
Vision encoder INT8, full quantization (max speed) |
textual_int8.onnx |
Text encoder INT8, full quantization (max speed) |
visual_int8_no_conv.onnx |
Vision INT8 without ConvInteger (e.g. ONNX Runtime in Rust) |
textual_int8_no_conv.onnx |
Text INT8 without ConvInteger (e.g. ONNX Runtime in Rust) |
bpe.model |
Original tokenizer (YouTokenToMe). Use with Python/ruclip. |
tokenizer.json |
Same vocab in Hugging Face format — for runtimes without YouTokenToMe support |
config.json |
Hyperparameters (mean/std, resolution) |
preprocessing.md |
Image and text preprocessing specification |
inference_spec.json |
Machine-readable model spec (CI/CD, tooling) |
The *_no_conv.onnx variants avoid the ConvInteger op for runtimes that do not implement it (e.g. ort in Rust). Same I/O as other INT8 models.
Usage
ONNX
Image: normalized [N,3,336,336] (see preprocessing.md). Text: [N,77] int64 (BOS + token_ids + EOS, pad to 77). Special IDs: pad=0, unk=1, bos=2, eos=3.
import onnxruntime as ort
v = ort.InferenceSession("visual.onnx", providers=["CPUExecutionProvider"])
t = ort.InferenceSession("textual.onnx", providers=["CPUExecutionProvider"])
img_emb = v.run(None, {v.get_inputs()[0].name: image})[0]
txt_emb = t.run(None, {t.get_inputs()[0].name: text})[0]
Tokenization: use bpe.model (Python/YouTokenToMe) or tokenizer.json (Hugging Face tokenizers). INT8: same I/O; use *_int8.onnx for max speed, or *_int8_no_conv.onnx where ConvInteger isn’t supported (e.g. ort in Rust).
For contributors (advanced)
Scripts to regenerate or modify models. Not needed for inference.
| Script | Purpose |
|---|---|
convert_to_onnx.py |
PyTorch → ONNX export |
verify_onnx.py |
Sanity check ONNX outputs |
export_tokenizer_json.py |
bpe.model → tokenizer.json |
compare_tokenizers.py |
Verify tokenizer parity |
quantize_visual_model.py |
visual.onnx → visual_int8.onnx + visual_int8_no_conv.onnx |
quantize_text_model.py |
textual.onnx → textual_int8.onnx + textual_int8_no_conv.onnx |
Use *_int8.onnx where ConvInteger is supported; use *_int8_no_conv.onnx for runtimes that don’t (e.g. ort in Rust).
Requires: pytorch_model.bin (for conversion), requirements-convert.txt. See script docstrings for usage.
Performance
Zero-shot metrics from the original RuCLIP paper (PyTorch). ONNX outputs are numerically equivalent.
| Dataset | Metric Name | Metric Result |
|---|---|---|
| Food101 | acc | 0.712 |
| CIFAR10 | acc | 0.906 |
| CIFAR100 | acc | 0.591 |
| Birdsnap | acc | 0.213 |
| SUN397 | acc | 0.523 |
| Stanford Cars | acc | 0.659 |
| DTD | acc | 0.408 |
| MNIST | acc | 0.242 |
| STL10 | acc | 0.956 |
| PCam | acc | 0.554 |
| CLEVR | acc | 0.142 |
| Rendered SST2 | acc | 0.539 |
| ImageNet | acc | 0.488 |
| FGVC Aircraft | mean-per-class | 0.075 |
| Oxford Pets | mean-per-class | 0.546 |
| Caltech101 | mean-per-class | 0.835 |
| Flowers102 | mean-per-class | 0.517 |
| HatefulMemes | roc-auc | 0.519 |
Authors
RuCLIP: Alex Shonenkov, Daniil Chesakov, Denis Dimitrov, Igor Pavlov — ai-forever/ru-clip.
ONNX conversion: Tim Tkachev.
- Downloads last month
- 28