Upload folder using huggingface_hub
Browse files- README.md +26 -3
- compare_tokenizers.py +63 -0
- export_tokenizer_json.py +124 -0
- inference_spec.json +17 -0
- preprocessing.md +24 -0
- tokenizer.json +0 -0
README.md
CHANGED
|
@@ -1,3 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# ruclip-vit-large-patch14-336
|
| 2 |
|
| 3 |
**RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model
|
|
@@ -24,14 +38,19 @@ Model was trained by [Sber AI](https://github.com/sberbank-ai) and [SberDevices]
|
|
| 24 |
|
| 25 |
| File | Purpose |
|
| 26 |
|------|---------|
|
| 27 |
-
| `config.json` | Hyperparameters
|
| 28 |
-
| `
|
|
|
|
|
|
|
|
|
|
| 29 |
| `pytorch_model.bin` | PyTorch weights |
|
| 30 |
| `visual.onnx` | Vision encoder (ONNX) — input `[N,3,336,336]` float32, output `[N,768]` |
|
| 31 |
| `textual.onnx` | Text encoder (ONNX) — input `[N,77]` int64, output `[N,768]` |
|
| 32 |
| `convert_to_onnx.py` | Export PyTorch → ONNX |
|
| 33 |
| `verify_onnx.py` | Sanity check ONNX; `--compare` checks against PyTorch (needs `pytorch_model.bin`) |
|
| 34 |
| `requirements-convert.txt` | Dependencies for conversion |
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## Usage
|
| 37 |
|
|
@@ -47,7 +66,9 @@ clip, processor = ruclip.load("ruclip-vit-large-patch14-336", device="cuda")
|
|
| 47 |
|
| 48 |
### ONNX
|
| 49 |
|
| 50 |
-
Image: normalized `[N,3,336,336]` (config mean/std). Text:
|
|
|
|
|
|
|
| 51 |
|
| 52 |
```python
|
| 53 |
import onnxruntime as ort
|
|
@@ -66,6 +87,8 @@ txt_emb = t.run(None, {t.get_inputs()[0].name: text})[0]
|
|
| 66 |
4. `python convert_to_onnx.py --output-dir .`
|
| 67 |
5. `python verify_onnx.py` (optionally `--compare` if `pytorch_model.bin` is present)
|
| 68 |
|
|
|
|
|
|
|
| 69 |
## Performance
|
| 70 |
We have evaluated the performance on the following datasets:
|
| 71 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- ru
|
| 4 |
+
license: mit
|
| 5 |
+
tags:
|
| 6 |
+
- clip
|
| 7 |
+
- onnx
|
| 8 |
+
- vision
|
| 9 |
+
- zero-shot-image-classification
|
| 10 |
+
- image-text-similarity
|
| 11 |
+
- text-ranking
|
| 12 |
+
- image-ranking
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
# ruclip-vit-large-patch14-336
|
| 16 |
|
| 17 |
**RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model
|
|
|
|
| 38 |
|
| 39 |
| File | Purpose |
|
| 40 |
|------|---------|
|
| 41 |
+
| `config.json` | Hyperparameters |
|
| 42 |
+
| `inference_spec.json` | Model I/O spec: shapes, dtype, mean/std |
|
| 43 |
+
| `preprocessing.md` | Image and text preprocessing specification |
|
| 44 |
+
| `bpe.model` | BPE tokenizer (YouTokenToMe) |
|
| 45 |
+
| `tokenizer.json` | Same vocab in HF format for Rust (`tokenizers` crate) |
|
| 46 |
| `pytorch_model.bin` | PyTorch weights |
|
| 47 |
| `visual.onnx` | Vision encoder (ONNX) — input `[N,3,336,336]` float32, output `[N,768]` |
|
| 48 |
| `textual.onnx` | Text encoder (ONNX) — input `[N,77]` int64, output `[N,768]` |
|
| 49 |
| `convert_to_onnx.py` | Export PyTorch → ONNX |
|
| 50 |
| `verify_onnx.py` | Sanity check ONNX; `--compare` checks against PyTorch (needs `pytorch_model.bin`) |
|
| 51 |
| `requirements-convert.txt` | Dependencies for conversion |
|
| 52 |
+
| `export_tokenizer_json.py` | Export `bpe.model` → `tokenizer.json` (with merges) |
|
| 53 |
+
| `compare_tokenizers.py` | Verify YouTokenToMe vs tokenizer.json output match |
|
| 54 |
|
| 55 |
## Usage
|
| 56 |
|
|
|
|
| 66 |
|
| 67 |
### ONNX
|
| 68 |
|
| 69 |
+
Image: normalized `[N,3,336,336]` (config mean/std). Text: `[N,77]` int64 (BOS + token_ids + EOS, pad to 77).
|
| 70 |
+
|
| 71 |
+
**Rust**: use `tokenizer.json` with the [tokenizers](https://github.com/huggingface/tokenizers) crate. Generate with `python export_tokenizer_json.py` (includes merges). Verify parity: `python compare_tokenizers.py`. Special IDs: pad=0, unk=1, bos=2, eos=3.
|
| 72 |
|
| 73 |
```python
|
| 74 |
import onnxruntime as ort
|
|
|
|
| 87 |
4. `python convert_to_onnx.py --output-dir .`
|
| 88 |
5. `python verify_onnx.py` (optionally `--compare` if `pytorch_model.bin` is present)
|
| 89 |
|
| 90 |
+
**Tokenizer for Rust:** `pip install youtokentome && python export_tokenizer_json.py` (extracts merges via `yttm vocab --verbose`). Verify: `python compare_tokenizers.py`.
|
| 91 |
+
|
| 92 |
## Performance
|
| 93 |
We have evaluated the performance on the following datasets:
|
| 94 |
|
compare_tokenizers.py
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Compare YouTokenToMe (bpe.model) vs tokenizer.json outputs.
|
| 4 |
+
Ensures tokenizer.json produces identical IDs for ONNX compatibility.
|
| 5 |
+
"""
|
| 6 |
+
import sys
|
| 7 |
+
|
| 8 |
+
TEST_TEXTS = [
|
| 9 |
+
"тест",
|
| 10 |
+
"собака",
|
| 11 |
+
"кошка",
|
| 12 |
+
"фото кота",
|
| 13 |
+
"красная машина",
|
| 14 |
+
"привет мир",
|
| 15 |
+
"а", # single char
|
| 16 |
+
"очень длинное предложение для проверки токенизации на русском языке",
|
| 17 |
+
]
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def main():
|
| 21 |
+
try:
|
| 22 |
+
import youtokentome as yttm
|
| 23 |
+
except ImportError:
|
| 24 |
+
print("Install: pip install youtokentome", file=sys.stderr)
|
| 25 |
+
sys.exit(1)
|
| 26 |
+
try:
|
| 27 |
+
from tokenizers import Tokenizer
|
| 28 |
+
except ImportError:
|
| 29 |
+
print("Install: pip install tokenizers", file=sys.stderr)
|
| 30 |
+
sys.exit(1)
|
| 31 |
+
|
| 32 |
+
yttm_bpe = yttm.BPE("bpe.model")
|
| 33 |
+
hf_tok = Tokenizer.from_file("tokenizer.json")
|
| 34 |
+
|
| 35 |
+
BOS_ID, EOS_ID = 2, 3
|
| 36 |
+
|
| 37 |
+
all_ok = True
|
| 38 |
+
for text in TEST_TEXTS:
|
| 39 |
+
text_lower = text.lower()
|
| 40 |
+
yttm_ids = yttm_bpe.encode([text_lower], output_type=yttm.OutputType.ID)[0]
|
| 41 |
+
yttm_seq = [BOS_ID] + list(yttm_ids) + [EOS_ID]
|
| 42 |
+
|
| 43 |
+
hf_enc = hf_tok.encode(text_lower)
|
| 44 |
+
hf_ids = [BOS_ID] + hf_enc.ids + [EOS_ID]
|
| 45 |
+
|
| 46 |
+
match = yttm_seq == hf_ids
|
| 47 |
+
status = "OK" if match else "MISMATCH"
|
| 48 |
+
if not match:
|
| 49 |
+
all_ok = False
|
| 50 |
+
print(f" {status}: {text[:40]!r}")
|
| 51 |
+
if not match:
|
| 52 |
+
print(f" YouTokenToMe: {yttm_seq[:20]}{'...' if len(yttm_seq) > 20 else ''}")
|
| 53 |
+
print(f" tokenizer.json: {hf_ids[:20]}{'...' if len(hf_ids) > 20 else ''}")
|
| 54 |
+
|
| 55 |
+
if all_ok:
|
| 56 |
+
print("\nPASS: All outputs match. tokenizer.json is compatible with bpe.model.")
|
| 57 |
+
else:
|
| 58 |
+
print("\nFAIL: Some outputs differ. Regenerate tokenizer.json with merges.")
|
| 59 |
+
sys.exit(1)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
if __name__ == "__main__":
|
| 63 |
+
main()
|
export_tokenizer_json.py
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Export bpe.model (YouTokenToMe) to tokenizer.json for Rust/ONNX inference.
|
| 4 |
+
Extracts vocab and merges via yttm vocab --verbose. Compatible with Hugging Face tokenizers.
|
| 5 |
+
"""
|
| 6 |
+
import json
|
| 7 |
+
import argparse
|
| 8 |
+
import subprocess
|
| 9 |
+
import sys
|
| 10 |
+
import re
|
| 11 |
+
|
| 12 |
+
# RuCLIP special token IDs (YouTokenToMe defaults)
|
| 13 |
+
PAD_ID, UNK_ID, BOS_ID, EOS_ID = 0, 1, 2, 3
|
| 14 |
+
CONTEXT_LENGTH = 77
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
def _get_merges_from_yttm_verbose(model_path: str) -> list[tuple[str, str]]:
|
| 18 |
+
"""Run yttm vocab --verbose and parse merges. Returns list of (left, right) token pairs in order."""
|
| 19 |
+
result = subprocess.run(
|
| 20 |
+
["yttm", "vocab", "--model", model_path, "--verbose"],
|
| 21 |
+
capture_output=True,
|
| 22 |
+
text=True,
|
| 23 |
+
timeout=300,
|
| 24 |
+
)
|
| 25 |
+
if result.returncode != 0:
|
| 26 |
+
raise RuntimeError(f"yttm vocab failed: {result.stderr}")
|
| 27 |
+
merges: list[tuple[str, str]] = []
|
| 28 |
+
# Format: id\ttoken or id\ttoken=left+right left_id+right_id
|
| 29 |
+
simple_re = re.compile(r"^(\d+)\t(.+)$")
|
| 30 |
+
for line in result.stdout.strip().split("\n"):
|
| 31 |
+
line = line.rstrip()
|
| 32 |
+
if not line:
|
| 33 |
+
continue
|
| 34 |
+
m = simple_re.match(line)
|
| 35 |
+
if not m:
|
| 36 |
+
continue
|
| 37 |
+
rest = m.group(2)
|
| 38 |
+
if "=" in rest:
|
| 39 |
+
token_z, merge_part = rest.split("=", 1)
|
| 40 |
+
parts = merge_part.split()
|
| 41 |
+
tok_part = parts[0] if parts else ""
|
| 42 |
+
if "+" in tok_part:
|
| 43 |
+
token_x, token_y = tok_part.split("+", 1)
|
| 44 |
+
merges.append((token_x, token_y))
|
| 45 |
+
return merges
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def export_tokenizer_json(bpe_model_path: str, output_path: str) -> None:
|
| 49 |
+
try:
|
| 50 |
+
import youtokentome as yttm
|
| 51 |
+
except ImportError:
|
| 52 |
+
print("Install: pip install youtokentome", file=sys.stderr)
|
| 53 |
+
sys.exit(1)
|
| 54 |
+
|
| 55 |
+
bpe_yttm = yttm.BPE(bpe_model_path)
|
| 56 |
+
vocab_list = bpe_yttm.vocab()
|
| 57 |
+
vocab = {tok: i for i, tok in enumerate(vocab_list)}
|
| 58 |
+
|
| 59 |
+
# Extract merges from yttm vocab --verbose
|
| 60 |
+
print("Extracting merges via yttm vocab --verbose...")
|
| 61 |
+
try:
|
| 62 |
+
merges = _get_merges_from_yttm_verbose(bpe_model_path)
|
| 63 |
+
# Only include merges where both tokens are in vocab (avoid parser artifacts like "=▁")
|
| 64 |
+
valid = [(x, y) for x, y in merges if x in vocab and y in vocab]
|
| 65 |
+
n_skipped = len(merges) - len(valid)
|
| 66 |
+
if n_skipped:
|
| 67 |
+
print(f" Skipped {n_skipped} merges (token not in vocab)")
|
| 68 |
+
if valid:
|
| 69 |
+
merges_hf = [f"{x} {y}" for x, y in valid]
|
| 70 |
+
ignore_merges = False
|
| 71 |
+
else:
|
| 72 |
+
merges_hf = []
|
| 73 |
+
ignore_merges = True
|
| 74 |
+
except Exception as e:
|
| 75 |
+
print(f"Could not extract merges ({e}), using ignore_merges=True", file=sys.stderr)
|
| 76 |
+
merges_hf = []
|
| 77 |
+
ignore_merges = True
|
| 78 |
+
|
| 79 |
+
obj = {
|
| 80 |
+
"version": "1.0",
|
| 81 |
+
"truncation": None,
|
| 82 |
+
"padding": None,
|
| 83 |
+
"added_tokens": [
|
| 84 |
+
{"id": PAD_ID, "special": True, "content": vocab_list[PAD_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
|
| 85 |
+
{"id": UNK_ID, "special": True, "content": vocab_list[UNK_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
|
| 86 |
+
{"id": BOS_ID, "special": True, "content": vocab_list[BOS_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
|
| 87 |
+
{"id": EOS_ID, "special": True, "content": vocab_list[EOS_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
|
| 88 |
+
],
|
| 89 |
+
"normalizer": {"type": "Lowercase"},
|
| 90 |
+
"pre_tokenizer": {"type": "Metaspace", "replacement": "\u2581", "prepend_scheme": "first"},
|
| 91 |
+
"post_processor": None,
|
| 92 |
+
"decoder": {"type": "Metaspace", "replacement": "\u2581", "prepend_scheme": "first"},
|
| 93 |
+
"model": {
|
| 94 |
+
"type": "BPE",
|
| 95 |
+
"dropout": None,
|
| 96 |
+
"unk_token": vocab_list[UNK_ID],
|
| 97 |
+
"continuing_subword_prefix": "",
|
| 98 |
+
"end_of_word_suffix": "",
|
| 99 |
+
"fuse_unk": False,
|
| 100 |
+
"byte_fallback": False,
|
| 101 |
+
"ignore_merges": ignore_merges,
|
| 102 |
+
"vocab": vocab,
|
| 103 |
+
"merges": merges_hf,
|
| 104 |
+
},
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
with open(output_path, "w", newline="\n") as f:
|
| 108 |
+
s = json.dumps(obj, ensure_ascii=True, indent=2)
|
| 109 |
+
f.write(s)
|
| 110 |
+
|
| 111 |
+
print(f"Saved {output_path} (vocab_size={len(vocab)}, merges={len(merges_hf)})")
|
| 112 |
+
print(" Special: pad=0 unk=1 bos=2 eos=3")
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def main():
|
| 116 |
+
p = argparse.ArgumentParser()
|
| 117 |
+
p.add_argument("--model", default="bpe.model", help="Path to bpe.model")
|
| 118 |
+
p.add_argument("--output", default="tokenizer.json", help="Output path")
|
| 119 |
+
args = p.parse_args()
|
| 120 |
+
export_tokenizer_json(args.model, args.output)
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
if __name__ == "__main__":
|
| 124 |
+
main()
|
inference_spec.json
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"embed_dim": 768,
|
| 3 |
+
"visual": {
|
| 4 |
+
"model": "visual.onnx",
|
| 5 |
+
"input": { "name": "input", "shape": ["N", 3, 336, 336], "dtype": "float32" },
|
| 6 |
+
"output": { "shape": ["N", 768], "dtype": "float32" }
|
| 7 |
+
},
|
| 8 |
+
"textual": {
|
| 9 |
+
"model": "textual.onnx",
|
| 10 |
+
"input": { "name": "input", "shape": ["N", 77], "dtype": "int64" },
|
| 11 |
+
"output": { "shape": ["N", 768], "dtype": "float32" }
|
| 12 |
+
},
|
| 13 |
+
"image_resolution": 336,
|
| 14 |
+
"context_length": 77,
|
| 15 |
+
"mean": [0.48145466, 0.4578275, 0.40821073],
|
| 16 |
+
"std": [0.26862954, 0.26130258, 0.27577711]
|
| 17 |
+
}
|
preprocessing.md
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Preprocessing Specification
|
| 2 |
+
|
| 3 |
+
## Image (visual.onnx)
|
| 4 |
+
|
| 5 |
+
- **Input shape:** `[N, 3, 336, 336]` (NCHW, batch first)
|
| 6 |
+
- **Input dtype:** float32
|
| 7 |
+
- **Layout:** RGB
|
| 8 |
+
- **Resolution:** 336×336 (center crop or resize without distortion to fill)
|
| 9 |
+
- **Normalization:** per-channel `(pixel / 255 - mean) / std`
|
| 10 |
+
|
| 11 |
+
| Channel | mean | std |
|
| 12 |
+
|---------|------|-----|
|
| 13 |
+
| R | 0.48145466 | 0.26862954 |
|
| 14 |
+
| G | 0.4578275 | 0.26130258 |
|
| 15 |
+
| B | 0.40821073 | 0.27577711 |
|
| 16 |
+
|
| 17 |
+
## Text (textual.onnx)
|
| 18 |
+
|
| 19 |
+
- **Input shape:** `[N, 77]`
|
| 20 |
+
- **Input dtype:** int64
|
| 21 |
+
- **Lowercase:** yes
|
| 22 |
+
- **Sequence:** `[BOS] + token_ids + [EOS]`, pad with 0 to length 77
|
| 23 |
+
- **Special IDs:** pad=0, unk=1, bos=2, eos=3
|
| 24 |
+
- **Tokenizer:** `tokenizer.json` or `bpe.model` (YouTokenToMe)
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|