Upload ONNX model and configs

Browse files

Files changed (11) hide show

1_Pooling/config.json +10 -0
README.md +96 -0
config.json +26 -0
config_sentence_transformers.json +14 -0
modules.json +20 -0
onnx/model.onnx +3 -0
sentence_bert_config.json +4 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +65 -0
vocab.txt +0 -0

1_Pooling/config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "word_embedding_dimension": 312,
+    "pooling_mode_cls_token": true,
+    "pooling_mode_mean_tokens": false,
+    "pooling_mode_max_tokens": false,
+    "pooling_mode_mean_sqrt_len_tokens": false,
+    "pooling_mode_weightedmean_tokens": false,
+    "pooling_mode_lasttoken": false,
+    "include_prompt": true
+}

README.md ADDED Viewed

	@@ -0,0 +1,96 @@

+---
+language:
+- ru
+pipeline_tag: sentence-similarity
+tags:
+- russian
+- pretraining
+- embeddings
+- tiny
+- feature-extraction
+- sentence-similarity
+- sentence-transformers
+- transformers
+license: mit
+base_model: cointegrated/rubert-tiny2
+---
+## Быстрый Bert для Semantic text similarity (STS) на CPU
+Быстрая модель BERT для расчетов компактных эмбеддингов предложений на русском языке. Модель основана на [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) - имеет аналогичные размеры контекста (2048), ембеддинга (312) и быстродействие.
+## Использование модели с библиотекой `transformers`:
+```python
+# pip install transformers sentencepiece
+import torch
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-tiny-sts")
+model = AutoModel.from_pretrained("sergeyzh/rubert-tiny-sts")
+# model.cuda()  # uncomment it if you have a GPU
+def embed_bert_cls(text, model, tokenizer):
+    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
+    with torch.no_grad():
+        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
+    embeddings = model_output.last_hidden_state[:, 0, :]
+    embeddings = torch.nn.functional.normalize(embeddings)
+    return embeddings[0].cpu().numpy()
+print(embed_bert_cls('привет мир', model, tokenizer).shape)
+# (312,)
+```
+## Использование с `sentence_transformers`:
+```Python
+from sentence_transformers import SentenceTransformer, util
+model = SentenceTransformer('sergeyzh/rubert-tiny-sts')
+sentences = ["привет мир", "hello world", "здравствуй вселенная"]
+embeddings = model.encode(sentences)
+print(util.dot_score(embeddings, embeddings))
+```
+## Метрики
+Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):
+| Модель                           | STS       | PI        | NLI       | SA        | TI        |
+|:---------------------------------|:---------:|:---------:|:---------:|:---------:|:---------:|
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)   |   0.862   |   0.727   |   0.473   |   0.810   |   0.979   |
+| [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts)       |    0.845   |   0.737   |   0.481   |   0.805   |   0.957   |
+| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts)     |   0.815   |   0.723   |   0.477   |   0.791   |   0.949   |
+| **sergeyzh/rubert-tiny-sts**     |   0.797   |   0.702   |   0.453   |   0.778   |   0.946   |
+| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) |   0.793   |   0.704   |   0.457   |   0.803   |   0.970   |
+| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)         |   0.794   |   0.659   |   0.431   |   0.761   |   0.946   |
+| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)        |   0.750   |   0.651   |   0.417   |   0.737   |   0.937   |
+**Задачи:**
+- Semantic text similarity (**STS**);
+- Paraphrase identification (**PI**);
+- Natural language inference (**NLI**);
+- Sentiment analysis (**SA**);
+- Toxicity identification (**TI**).
+## Быстродействие и размеры
+На бенчмарке [encodechka](https://github.com/avidale/encodechka):
+| Модель                           | CPU       | GPU       | size      | dim       | n_ctx     | n_vocab   |
+|:---------------------------------|----------:|----------:|----------:|----------:|----------:|----------:|
+| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)   | 149.026   |  15.629   |   2136    |   1024    |    514    |  250002   |
+| [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts)      |  42.835   |   8.561   |    490    |    768    |    512    |   55083    |
+| [sergeyzh/rubert-mini-sts](https://huggingface.co/sergeyzh/rubert-mini-sts)     |   6.417   |   5.517   |    123    |    312    |    2048   |   83828   |
+| **sergeyzh/rubert-tiny-sts**     |   3.208   |   3.379   |    111    |    312    |    2048   |   83828   |
+| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) |  43.314   |   9.338   |    532    |    768    |    512    |   69382   |
+| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru)         |  42.867   |   8.549   |    490    |    768    |    512    |   55083   |
+| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)        |   3.212   |   3.384   |    111    |    312    |    2048   |   83828   |

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "BertForPreTraining"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "emb_size": 312,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 312,
+  "initializer_range": 0.02,
+  "intermediate_size": 600,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 2048,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 3,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.6",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 83828
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SentenceTransformer",
+  "__version__": {
+    "sentence_transformers": "5.2.0",
+    "transformers": "4.57.6",
+    "pytorch": "2.9.1+cu128"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

modules.json ADDED Viewed

	@@ -0,0 +1,20 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.models.Transformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_Pooling",
+    "type": "sentence_transformers.models.Pooling"
+  },
+  {
+    "idx": 2,
+    "name": "2",
+    "path": "2_Normalize",
+    "type": "sentence_transformers.models.Normalize"
+  }
+]

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:566efbbd349b2d882a3dd03bce12c4b88a799fce2e7255dba5f0af7f4b4eb302
+size 116451755

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 2048,
+    "do_lower_case": false
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_length": 512,
+  "model_max_length": 2048,
+  "never_split": null,
+  "pad_to_multiple_of": null,
+  "pad_token": "[PAD]",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff