ruclip-vit-large-patch14-336-onnx

ttkacheff commited on Feb 22

Commit

de7a6b5

verified ·

1 Parent(s): 609ea96

Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

README.md +26 -3
compare_tokenizers.py +63 -0
export_tokenizer_json.py +124 -0
inference_spec.json +17 -0
preprocessing.md +24 -0
tokenizer.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
 # ruclip-vit-large-patch14-336
 **RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model
@@ -24,14 +38,19 @@ Model was trained by [Sber AI](https://github.com/sberbank-ai) and [SberDevices]
 | File | Purpose |
 |------|---------|
-| `config.json` | Hyperparameters (resolution 336, embed 768, etc.) |
-| `bpe.model` | BPE tokenizer for text |
 | `pytorch_model.bin` | PyTorch weights |
 | `visual.onnx` | Vision encoder (ONNX) — input `[N,3,336,336]` float32, output `[N,768]` |
 | `textual.onnx` | Text encoder (ONNX) — input `[N,77]` int64, output `[N,768]` |
 | `convert_to_onnx.py` | Export PyTorch → ONNX |
 | `verify_onnx.py` | Sanity check ONNX; `--compare` checks against PyTorch (needs `pytorch_model.bin`) |
 | `requirements-convert.txt` | Dependencies for conversion |
 ## Usage
@@ -47,7 +66,9 @@ clip, processor = ruclip.load("ruclip-vit-large-patch14-336", device="cuda")
 ### ONNX
-Image: normalized `[N,3,336,336]` (config mean/std). Text: tokenized via `bpe.model` to `[N,77]`.
 ```python
 import onnxruntime as ort
@@ -66,6 +87,8 @@ txt_emb = t.run(None, {t.get_inputs()[0].name: text})[0]
 4. `python convert_to_onnx.py --output-dir .`
 5. `python verify_onnx.py` (optionally `--compare` if `pytorch_model.bin` is present)
 ## Performance
 We have evaluated the performance on the following datasets:

+---
+language:
+  - ru
+license: mit
+tags:
+  - clip
+  - onnx
+  - vision
+  - zero-shot-image-classification
+  - image-text-similarity
+  - text-ranking
+  - image-ranking
+---
 # ruclip-vit-large-patch14-336
 **RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model
 | File | Purpose |
 |------|---------|
+| `config.json` | Hyperparameters |
+| `inference_spec.json` | Model I/O spec: shapes, dtype, mean/std |
+| `preprocessing.md` | Image and text preprocessing specification |
+| `bpe.model` | BPE tokenizer (YouTokenToMe) |
+| `tokenizer.json` | Same vocab in HF format for Rust (`tokenizers` crate) |
 | `pytorch_model.bin` | PyTorch weights |
 | `visual.onnx` | Vision encoder (ONNX) — input `[N,3,336,336]` float32, output `[N,768]` |
 | `textual.onnx` | Text encoder (ONNX) — input `[N,77]` int64, output `[N,768]` |
 | `convert_to_onnx.py` | Export PyTorch → ONNX |
 | `verify_onnx.py` | Sanity check ONNX; `--compare` checks against PyTorch (needs `pytorch_model.bin`) |
 | `requirements-convert.txt` | Dependencies for conversion |
+| `export_tokenizer_json.py` | Export `bpe.model` → `tokenizer.json` (with merges) |
+| `compare_tokenizers.py` | Verify YouTokenToMe vs tokenizer.json output match |
 ## Usage
 ### ONNX
+Image: normalized `[N,3,336,336]` (config mean/std). Text: `[N,77]` int64 (BOS + token_ids + EOS, pad to 77).
+**Rust**: use `tokenizer.json` with the [tokenizers](https://github.com/huggingface/tokenizers) crate. Generate with `python export_tokenizer_json.py` (includes merges). Verify parity: `python compare_tokenizers.py`. Special IDs: pad=0, unk=1, bos=2, eos=3.
 ```python
 import onnxruntime as ort
 4. `python convert_to_onnx.py --output-dir .`
 5. `python verify_onnx.py` (optionally `--compare` if `pytorch_model.bin` is present)
+**Tokenizer for Rust:** `pip install youtokentome && python export_tokenizer_json.py` (extracts merges via `yttm vocab --verbose`). Verify: `python compare_tokenizers.py`.
 ## Performance
 We have evaluated the performance on the following datasets:

compare_tokenizers.py ADDED Viewed

	@@ -0,0 +1,63 @@

+#!/usr/bin/env python3
+"""
+Compare YouTokenToMe (bpe.model) vs tokenizer.json outputs.
+Ensures tokenizer.json produces identical IDs for ONNX compatibility.
+"""
+import sys
+TEST_TEXTS = [
+    "тест",
+    "собака",
+    "кошка",
+    "фото кота",
+    "красная машина",
+    "привет мир",
+    "а",  # single char
+    "очень длинное предложение для проверки токенизации на русском языке",
+]
+def main():
+    try:
+        import youtokentome as yttm
+    except ImportError:
+        print("Install: pip install youtokentome", file=sys.stderr)
+        sys.exit(1)
+    try:
+        from tokenizers import Tokenizer
+    except ImportError:
+        print("Install: pip install tokenizers", file=sys.stderr)
+        sys.exit(1)
+    yttm_bpe = yttm.BPE("bpe.model")
+    hf_tok = Tokenizer.from_file("tokenizer.json")
+    BOS_ID, EOS_ID = 2, 3
+    all_ok = True
+    for text in TEST_TEXTS:
+        text_lower = text.lower()
+        yttm_ids = yttm_bpe.encode([text_lower], output_type=yttm.OutputType.ID)[0]
+        yttm_seq = [BOS_ID] + list(yttm_ids) + [EOS_ID]
+        hf_enc = hf_tok.encode(text_lower)
+        hf_ids = [BOS_ID] + hf_enc.ids + [EOS_ID]
+        match = yttm_seq == hf_ids
+        status = "OK" if match else "MISMATCH"
+        if not match:
+            all_ok = False
+        print(f"  {status}: {text[:40]!r}")
+        if not match:
+            print(f"    YouTokenToMe: {yttm_seq[:20]}{'...' if len(yttm_seq) > 20 else ''}")
+            print(f"    tokenizer.json: {hf_ids[:20]}{'...' if len(hf_ids) > 20 else ''}")
+    if all_ok:
+        print("\nPASS: All outputs match. tokenizer.json is compatible with bpe.model.")
+    else:
+        print("\nFAIL: Some outputs differ. Regenerate tokenizer.json with merges.")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

export_tokenizer_json.py ADDED Viewed

	@@ -0,0 +1,124 @@

+#!/usr/bin/env python3
+"""
+Export bpe.model (YouTokenToMe) to tokenizer.json for Rust/ONNX inference.
+Extracts vocab and merges via yttm vocab --verbose. Compatible with Hugging Face tokenizers.
+"""
+import json
+import argparse
+import subprocess
+import sys
+import re
+# RuCLIP special token IDs (YouTokenToMe defaults)
+PAD_ID, UNK_ID, BOS_ID, EOS_ID = 0, 1, 2, 3
+CONTEXT_LENGTH = 77
+def _get_merges_from_yttm_verbose(model_path: str) -> list[tuple[str, str]]:
+    """Run yttm vocab --verbose and parse merges. Returns list of (left, right) token pairs in order."""
+    result = subprocess.run(
+        ["yttm", "vocab", "--model", model_path, "--verbose"],
+        capture_output=True,
+        text=True,
+        timeout=300,
+    )
+    if result.returncode != 0:
+        raise RuntimeError(f"yttm vocab failed: {result.stderr}")
+    merges: list[tuple[str, str]] = []
+    # Format: id\ttoken  or  id\ttoken=left+right    left_id+right_id
+    simple_re = re.compile(r"^(\d+)\t(.+)$")
+    for line in result.stdout.strip().split("\n"):
+        line = line.rstrip()
+        if not line:
+            continue
+        m = simple_re.match(line)
+        if not m:
+            continue
+        rest = m.group(2)
+        if "=" in rest:
+            token_z, merge_part = rest.split("=", 1)
+            parts = merge_part.split()
+            tok_part = parts[0] if parts else ""
+            if "+" in tok_part:
+                token_x, token_y = tok_part.split("+", 1)
+                merges.append((token_x, token_y))
+    return merges
+def export_tokenizer_json(bpe_model_path: str, output_path: str) -> None:
+    try:
+        import youtokentome as yttm
+    except ImportError:
+        print("Install: pip install youtokentome", file=sys.stderr)
+        sys.exit(1)
+    bpe_yttm = yttm.BPE(bpe_model_path)
+    vocab_list = bpe_yttm.vocab()
+    vocab = {tok: i for i, tok in enumerate(vocab_list)}
+    # Extract merges from yttm vocab --verbose
+    print("Extracting merges via yttm vocab --verbose...")
+    try:
+        merges = _get_merges_from_yttm_verbose(bpe_model_path)
+        # Only include merges where both tokens are in vocab (avoid parser artifacts like "=▁")
+        valid = [(x, y) for x, y in merges if x in vocab and y in vocab]
+        n_skipped = len(merges) - len(valid)
+        if n_skipped:
+            print(f"  Skipped {n_skipped} merges (token not in vocab)")
+        if valid:
+            merges_hf = [f"{x} {y}" for x, y in valid]
+            ignore_merges = False
+        else:
+            merges_hf = []
+            ignore_merges = True
+    except Exception as e:
+        print(f"Could not extract merges ({e}), using ignore_merges=True", file=sys.stderr)
+        merges_hf = []
+        ignore_merges = True
+    obj = {
+        "version": "1.0",
+        "truncation": None,
+        "padding": None,
+        "added_tokens": [
+            {"id": PAD_ID, "special": True, "content": vocab_list[PAD_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
+            {"id": UNK_ID, "special": True, "content": vocab_list[UNK_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
+            {"id": BOS_ID, "special": True, "content": vocab_list[BOS_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
+            {"id": EOS_ID, "special": True, "content": vocab_list[EOS_ID], "single_word": False, "lstrip": False, "rstrip": False, "normalized": False},
+        ],
+        "normalizer": {"type": "Lowercase"},
+        "pre_tokenizer": {"type": "Metaspace", "replacement": "\u2581", "prepend_scheme": "first"},
+        "post_processor": None,
+        "decoder": {"type": "Metaspace", "replacement": "\u2581", "prepend_scheme": "first"},
+        "model": {
+            "type": "BPE",
+            "dropout": None,
+            "unk_token": vocab_list[UNK_ID],
+            "continuing_subword_prefix": "",
+            "end_of_word_suffix": "",
+            "fuse_unk": False,
+            "byte_fallback": False,
+            "ignore_merges": ignore_merges,
+            "vocab": vocab,
+            "merges": merges_hf,
+        },
+    }
+    with open(output_path, "w", newline="\n") as f:
+        s = json.dumps(obj, ensure_ascii=True, indent=2)
+        f.write(s)
+    print(f"Saved {output_path} (vocab_size={len(vocab)}, merges={len(merges_hf)})")
+    print("  Special: pad=0 unk=1 bos=2 eos=3")
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--model", default="bpe.model", help="Path to bpe.model")
+    p.add_argument("--output", default="tokenizer.json", help="Output path")
+    args = p.parse_args()
+    export_tokenizer_json(args.model, args.output)
+if __name__ == "__main__":
+    main()

inference_spec.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "embed_dim": 768,
+  "visual": {
+    "model": "visual.onnx",
+    "input": { "name": "input", "shape": ["N", 3, 336, 336], "dtype": "float32" },
+    "output": { "shape": ["N", 768], "dtype": "float32" }
+  },
+  "textual": {
+    "model": "textual.onnx",
+    "input": { "name": "input", "shape": ["N", 77], "dtype": "int64" },
+    "output": { "shape": ["N", 768], "dtype": "float32" }
+  },
+  "image_resolution": 336,
+  "context_length": 77,
+  "mean": [0.48145466, 0.4578275, 0.40821073],
+  "std": [0.26862954, 0.26130258, 0.27577711]
+}

preprocessing.md ADDED Viewed

	@@ -0,0 +1,24 @@

+# Preprocessing Specification
+## Image (visual.onnx)
+- **Input shape:** `[N, 3, 336, 336]` (NCHW, batch first)
+- **Input dtype:** float32
+- **Layout:** RGB
+- **Resolution:** 336×336 (center crop or resize without distortion to fill)
+- **Normalization:** per-channel `(pixel / 255 - mean) / std`
+| Channel | mean | std |
+|---------|------|-----|
+| R | 0.48145466 | 0.26862954 |
+| G | 0.4578275 | 0.26130258 |
+| B | 0.40821073 | 0.27577711 |
+## Text (textual.onnx)
+- **Input shape:** `[N, 77]`
+- **Input dtype:** int64
+- **Lowercase:** yes
+- **Sequence:** `[BOS] + token_ids + [EOS]`, pad with 0 to length 77
+- **Special IDs:** pad=0, unk=1, bos=2, eos=3
+- **Tokenizer:** `tokenizer.json` or `bpe.model` (YouTokenToMe)

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff