ttkacheff commited on
Commit
de7a6b5
·
verified ·
1 Parent(s): 609ea96

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # ruclip-vit-large-patch14-336
2
 
3
  **RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model
@@ -24,14 +38,19 @@ Model was trained by [Sber AI](https://github.com/sberbank-ai) and [SberDevices]
24
 
25
  | File | Purpose |
26
  |------|---------|
27
- | `config.json` | Hyperparameters (resolution 336, embed 768, etc.) |
28
- | `bpe.model` | BPE tokenizer for text |
 
 
 
29
  | `pytorch_model.bin` | PyTorch weights |
30
  | `visual.onnx` | Vision encoder (ONNX) — input `[N,3,336,336]` float32, output `[N,768]` |
31
  | `textual.onnx` | Text encoder (ONNX) — input `[N,77]` int64, output `[N,768]` |
32
  | `convert_to_onnx.py` | Export PyTorch → ONNX |
33
  | `verify_onnx.py` | Sanity check ONNX; `--compare` checks against PyTorch (needs `pytorch_model.bin`) |
34
  | `requirements-convert.txt` | Dependencies for conversion |
 
 
35
 
36
  ## Usage
37
 
@@ -47,7 +66,9 @@ clip, processor = ruclip.load("ruclip-vit-large-patch14-336", device="cuda")
47
 
48
  ### ONNX
49
 
50
- Image: normalized `[N,3,336,336]` (config mean/std). Text: tokenized via `bpe.model` to `[N,77]`.
 
 
51
 
52
  ```python
53
  import onnxruntime as ort
@@ -66,6 +87,8 @@ txt_emb = t.run(None, {t.get_inputs()[0].name: text})[0]
66
  4. `python convert_to_onnx.py --output-dir .`
67
  5. `python verify_onnx.py` (optionally `--compare` if `pytorch_model.bin` is present)
68
 
 
 
69
  ## Performance
70
  We have evaluated the performance on the following datasets:
71
 
 
1
+ ---
2
+ language:
3
+ - ru
4
+ license: mit
5
+ tags:
6
+ - clip
7
+ - onnx
8
+ - vision
9
+ - zero-shot-image-classification
10
+ - image-text-similarity
11
+ - text-ranking
12
+ - image-ranking
13
+ ---
14
+
15
  # ruclip-vit-large-patch14-336
16
 
17
  **RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model
 
38
 
39
  | File | Purpose |
40
  |------|---------|
41
+ | `config.json` | Hyperparameters |
42
+ | `inference_spec.json` | Model I/O spec: shapes, dtype, mean/std |
43
+ | `preprocessing.md` | Image and text preprocessing specification |
44
+ | `bpe.model` | BPE tokenizer (YouTokenToMe) |
45
+ | `tokenizer.json` | Same vocab in HF format for Rust (`tokenizers` crate) |
46
  | `pytorch_model.bin` | PyTorch weights |
47
  | `visual.onnx` | Vision encoder (ONNX) — input `[N,3,336,336]` float32, output `[N,768]` |
48
  | `textual.onnx` | Text encoder (ONNX) — input `[N,77]` int64, output `[N,768]` |
49
  | `convert_to_onnx.py` | Export PyTorch → ONNX |
50
  | `verify_onnx.py` | Sanity check ONNX; `--compare` checks against PyTorch (needs `pytorch_model.bin`) |
51
  | `requirements-convert.txt` | Dependencies for conversion |
52
+ | `export_tokenizer_json.py` | Export `bpe.model` → `tokenizer.json` (with merges) |
53
+ | `compare_tokenizers.py` | Verify YouTokenToMe vs tokenizer.json output match |
54
 
55
  ## Usage
56
 
 
66
 
67
  ### ONNX
68
 
69
+ Image: normalized `[N,3,336,336]` (config mean/std). Text: `[N,77]` int64 (BOS + token_ids + EOS, pad to 77).
70
+
71
+ **Rust**: use `tokenizer.json` with the [tokenizers](https://github.com/huggingface/tokenizers) crate. Generate with `python export_tokenizer_json.py` (includes merges). Verify parity: `python compare_tokenizers.py`. Special IDs: pad=0, unk=1, bos=2, eos=3.
72
 
73
  ```python
74
  import onnxruntime as ort
 
87
  4. `python convert_to_onnx.py --output-dir .`
88
  5. `python verify_onnx.py` (optionally `--compare` if `pytorch_model.bin` is present)
89
 
90
+ **Tokenizer for Rust:** `pip install youtokentome && python export_tokenizer_json.py` (extracts merges via `yttm vocab --verbose`). Verify: `python compare_tokenizers.py`.
91
+
92
  ## Performance
93
  We have evaluated the performance on the following datasets:
94
 
compare_tokenizers.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Compare YouTokenToMe (bpe.model) vs tokenizer.json outputs.
4
+ Ensures tokenizer.json produces identical IDs for ONNX compatibility.
5
+ """
6
+ import sys
7
+
8
+ TEST_TEXTS = [
9
+ "тест",
10
+ "собака",
11
+ "кошка",
12
+ "фото кота",
13
+ "красная машина",
14
+ "привет мир",
15
+ "а", # single char
16
+ "очень длинное предложение для проверки токенизации на русском языке",
17
+ ]
18
+
19
+
20
+ def main():
21
+ try:
22
+ import youtokentome as yttm
23
+ except ImportError:
24
+ print("Install: pip install youtokentome", file=sys.stderr)
25
+ sys.exit(1)
26
+ try:
27
+ from tokenizers import Tokenizer
28
+ except ImportError:
29
+ print("Install: pip install tokenizers", file=sys.stderr)
30
+ sys.exit(1)
31
+
32
+ yttm_bpe = yttm.BPE("bpe.model")
33
+ hf_tok = Tokenizer.from_file("tokenizer.json")
34
+
35
+ BOS_ID, EOS_ID = 2, 3
36
+
37
+ all_ok = True
38
+ for text in TEST_TEXTS:
39
+ text_lower = text.lower()
40
+ yttm_ids = yttm_bpe.encode([text_lower], output_type=yttm.OutputType.ID)[0]
41
+ yttm_seq = [BOS_ID] + list(yttm_ids) + [EOS_ID]
42
+
43
+ hf_enc = hf_tok.encode(text_lower)
44
+ hf_ids = [BOS_ID] + hf_enc.ids + [EOS_ID]
45
+
46
+ match = yttm_seq == hf_ids
47
+ status = "OK" if match else "MISMATCH"
48
+ if not match:
49
+ all_ok = False
50
+ print(f" {status}: {text[:40]!r}")
51
+ if not match:
52
+ print(f" YouTokenToMe: {yttm_seq[:20]}{'...' if len(yttm_seq) > 20 else ''}")
53
+ print(f" tokenizer.json: {hf_ids[:20]}{'...' if len(hf_ids) > 20 else ''}")
54
+
55
+ if all_ok:
56
+ print("\nPASS: All outputs match. tokenizer.json is compatible with bpe.model.")
57
+ else:
58
+ print("\nFAIL: Some outputs differ. Regenerate tokenizer.json with merges.")
59
+ sys.exit(1)
60
+
61
+
62
+ if __name__ == "__main__":
63
+ main()
export_tokenizer_json.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Export bpe.model (YouTokenToMe) to tokenizer.json for Rust/ONNX inference.
4
+ Extracts vocab and merges via yttm vocab --verbose. Compatible with Hugging Face tokenizers.
5
+ """
6
+ import json
7
+ import argparse
8
+ import subprocess
9
+ import sys
10
+ import re
11
+
12
+ # RuCLIP special token IDs (YouTokenToMe defaults)
13
+ PAD_ID, UNK_ID, BOS_ID, EOS_ID = 0, 1, 2, 3
14
+ CONTEXT_LENGTH = 77
15
+
16
+
17
+ def _get_merges_from_yttm_verbose(model_path: str) -> list[tuple[str, str]]:
18
+ """Run yttm vocab --verbose and parse merges. Returns list of (left, right) token pairs in order."""
19
+ result = subprocess.run(
20
+ ["yttm", "vocab", "--model", model_path, "--verbose"],
21
+ capture_output=True,
22
+ text=True,
23
+ timeout=300,
24
+ )
25
+ if result.returncode != 0:
26
+ raise RuntimeError(f"yttm vocab failed: {result.stderr}")
27
+ merges: list[tuple[str, str]] = []
28
+ # Format: id\ttoken or id\ttoken=left+right left_id+right_id
29
+ simple_re = re.compile(r"^(\d+)\t(.+)$")
30
+ for line in result.stdout.strip().split("\n"):
31
+ line = line.rstrip()
32
+ if not line:
33
+ continue
34
+ m = simple_re.match(line)
35
+ if not m:
36
+ continue
37
+ rest = m.group(2)
38
+ if "=" in rest:
39
+ token_z, merge_part = rest.split("=", 1)
40
+ parts = merge_part.split()
41
+ tok_part = parts[0] if parts else ""
42
+ if "+" in tok_part:
43
+ token_x, token_y = tok_part.split("+", 1)
44
+ merges.append((token_x, token_y))
45
+ return merges
46
+
47
+
48
+ def export_tokenizer_json(bpe_model_path: str, output_path: str) -> None:
49
+ try:
50
+ import youtokentome as yttm
51
+ except ImportError:
52
+ print("Install: pip install youtokentome", file=sys.stderr)
53
+ sys.exit(1)
54
+
55
+ bpe_yttm = yttm.BPE(bpe_model_path)
56
+ vocab_list = bpe_yttm.vocab()
57
+ vocab = {tok: i for i, tok in enumerate(vocab_list)}
58
+
59
+ # Extract merges from yttm vocab --verbose
60
+ print("Extracting merges via yttm vocab --verbose...")
61
+ try:
62
+ merges = _get_merges_from_yttm_verbose(bpe_model_path)
63
+ # Only include merges where both tokens are in vocab (avoid parser artifacts like "=▁")
64
+ valid = [(x, y) for x, y in merges if x in vocab and y in vocab]
65
+ n_skipped = len(merges) - len(valid)
66
+ if n_skipped:
67
+ print(f" Skipped {n_skipped} merges (token not in vocab)")
68
+ if valid:
69
+ merges_hf = [f"{x} {y}" for x, y in valid]
70
+ ignore_merges = False
71
+ else:
72
+ merges_hf = []
73
+ ignore_merges = True
74
+ except Exception as e:
75
+ print(f"Could not extract merges ({e}), using ignore_merges=True", file=sys.stderr)
76
+ merges_hf = []
77
+ ignore_merges = True
78
+
79
+ obj = {
80
+ "version": "1.0",
81
+ "truncation": None,
82
+ "padding": None,
83
+ "added_tokens": [
84
+ {"id": PAD_ID, "special": True, "content": vocab_list[PAD_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
85
+ {"id": UNK_ID, "special": True, "content": vocab_list[UNK_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
86
+ {"id": BOS_ID, "special": True, "content": vocab_list[BOS_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
87
+ {"id": EOS_ID, "special": True, "content": vocab_list[EOS_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
88
+ ],
89
+ "normalizer": {"type": "Lowercase"},
90
+ "pre_tokenizer": {"type": "Metaspace", "replacement": "\u2581", "prepend_scheme": "first"},
91
+ "post_processor": None,
92
+ "decoder": {"type": "Metaspace", "replacement": "\u2581", "prepend_scheme": "first"},
93
+ "model": {
94
+ "type": "BPE",
95
+ "dropout": None,
96
+ "unk_token": vocab_list[UNK_ID],
97
+ "continuing_subword_prefix": "",
98
+ "end_of_word_suffix": "",
99
+ "fuse_unk": False,
100
+ "byte_fallback": False,
101
+ "ignore_merges": ignore_merges,
102
+ "vocab": vocab,
103
+ "merges": merges_hf,
104
+ },
105
+ }
106
+
107
+ with open(output_path, "w", newline="\n") as f:
108
+ s = json.dumps(obj, ensure_ascii=True, indent=2)
109
+ f.write(s)
110
+
111
+ print(f"Saved {output_path} (vocab_size={len(vocab)}, merges={len(merges_hf)})")
112
+ print(" Special: pad=0 unk=1 bos=2 eos=3")
113
+
114
+
115
+ def main():
116
+ p = argparse.ArgumentParser()
117
+ p.add_argument("--model", default="bpe.model", help="Path to bpe.model")
118
+ p.add_argument("--output", default="tokenizer.json", help="Output path")
119
+ args = p.parse_args()
120
+ export_tokenizer_json(args.model, args.output)
121
+
122
+
123
+ if __name__ == "__main__":
124
+ main()
inference_spec.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "embed_dim": 768,
3
+ "visual": {
4
+ "model": "visual.onnx",
5
+ "input": { "name": "input", "shape": ["N", 3, 336, 336], "dtype": "float32" },
6
+ "output": { "shape": ["N", 768], "dtype": "float32" }
7
+ },
8
+ "textual": {
9
+ "model": "textual.onnx",
10
+ "input": { "name": "input", "shape": ["N", 77], "dtype": "int64" },
11
+ "output": { "shape": ["N", 768], "dtype": "float32" }
12
+ },
13
+ "image_resolution": 336,
14
+ "context_length": 77,
15
+ "mean": [0.48145466, 0.4578275, 0.40821073],
16
+ "std": [0.26862954, 0.26130258, 0.27577711]
17
+ }
preprocessing.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Preprocessing Specification
2
+
3
+ ## Image (visual.onnx)
4
+
5
+ - **Input shape:** `[N, 3, 336, 336]` (NCHW, batch first)
6
+ - **Input dtype:** float32
7
+ - **Layout:** RGB
8
+ - **Resolution:** 336×336 (center crop or resize without distortion to fill)
9
+ - **Normalization:** per-channel `(pixel / 255 - mean) / std`
10
+
11
+ | Channel | mean | std |
12
+ |---------|------|-----|
13
+ | R | 0.48145466 | 0.26862954 |
14
+ | G | 0.4578275 | 0.26130258 |
15
+ | B | 0.40821073 | 0.27577711 |
16
+
17
+ ## Text (textual.onnx)
18
+
19
+ - **Input shape:** `[N, 77]`
20
+ - **Input dtype:** int64
21
+ - **Lowercase:** yes
22
+ - **Sequence:** `[BOS] + token_ids + [EOS]`, pad with 0 to length 77
23
+ - **Special IDs:** pad=0, unk=1, bos=2, eos=3
24
+ - **Tokenizer:** `tokenizer.json` or `bpe.model` (YouTokenToMe)
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff