WpythonW commited on
Commit
22e0853
·
verified ·
1 Parent(s): 35812e7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -27
README.md CHANGED
@@ -3,40 +3,60 @@ language:
3
  - ru
4
  pipeline_tag: sentence-similarity
5
  tags:
 
6
  - russian
7
- - fill-mask
8
- - pretraining
9
  - embeddings
10
- - masked-lm
11
- - tiny
12
- - feature-extraction
13
- - sentence-similarity
14
  - sentence-transformers
15
- - transformers
 
16
  license: mit
17
- widget:
18
- - text: Миниатюрная модель для [MASK] разных задач.
19
  ---
20
- This is an updated version of [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny): a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.
21
 
22
- The differences from the previous version include:
23
- - a larger vocabulary: 83828 tokens instead of 29564;
24
- - larger supported sequences: 2048 instead of 512;
25
- - sentence embeddings approximate LaBSE closer than before;
26
- - meaningful segment embeddings (tuned on the NLI task)
27
- - the model is focused only on Russian.
28
 
29
- The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.
30
 
31
- Sentence embeddings can be produced as follows:
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```python
34
- # pip install transformers sentencepiece
35
  import torch
36
  from transformers import AutoTokenizer, AutoModel
37
- tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
38
- model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
39
- # model.cuda() # uncomment it if you have a GPU
40
 
41
  def embed_bert_cls(text, model, tokenizer):
42
  t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
@@ -50,11 +70,48 @@ print(embed_bert_cls('привет мир', model, tokenizer).shape)
50
  # (312,)
51
  ```
52
 
53
- Alternatively, you can use the model with `sentence_transformers`:
54
- ```Python
55
  from sentence_transformers import SentenceTransformer
56
- model = SentenceTransformer('cointegrated/rubert-tiny2')
 
57
  sentences = ["привет мир", "hello world", "здравствуй вселенная"]
58
  embeddings = model.encode(sentences)
59
- print(embeddings)
60
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  - ru
4
  pipeline_tag: sentence-similarity
5
  tags:
6
+ - english
7
  - russian
 
 
8
  - embeddings
 
 
 
 
9
  - sentence-transformers
10
+ - vllm
11
+ - inference-optimized
12
  license: mit
13
+ base_model: cointegrated/rubert-tiny2
 
14
  ---
 
15
 
16
+ # rubert-tiny2-vllm
 
 
 
 
 
17
 
18
+ **vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference.
19
 
20
+ This model produces **numerically identical embeddings** to the original (max difference ~3e-7 due to float32 precision) while enabling significant speedup through vLLM's optimized kernels and batching.
21
 
22
+ ## Modifications
23
+
24
+ - **No weight changes** - uses original query/key/value weights directly
25
+ - vLLM automatically converts Q/K/V to fused qkv_proj format during loading
26
+ - Removed pretraining heads (MLM/NSP) - not needed for embeddings
27
+ - Changed architecture to `BertModel` for vLLM compatibility
28
+
29
+ ## Usage
30
+
31
+ ### vLLM Server
32
+ ```bash
33
+ # IMPORTANT: Use fp32 for exact numerical match with original model
34
+ vllm serve WpythonW/rubert-tiny2-vllm --dtype float32
35
+ ```
36
+
37
+ ### OpenAI-compatible API
38
+ ```python
39
+ from openai import OpenAI
40
+
41
+ client = OpenAI(
42
+ base_url="http://localhost:8000/v1",
43
+ api_key="dummy"
44
+ )
45
+
46
+ response = client.embeddings.create(
47
+ input="Привет мир",
48
+ model="WpythonW/rubert-tiny2-vllm"
49
+ )
50
+ print(response.data[0].embedding[:5])
51
+ ```
52
+
53
+ ### Transformers
54
  ```python
 
55
  import torch
56
  from transformers import AutoTokenizer, AutoModel
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
59
+ model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")
60
 
61
  def embed_bert_cls(text, model, tokenizer):
62
  t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
 
70
  # (312,)
71
  ```
72
 
73
+ ### Sentence Transformers
74
+ ```python
75
  from sentence_transformers import SentenceTransformer
76
+
77
+ model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
78
  sentences = ["привет мир", "hello world", "здравствуй вселенная"]
79
  embeddings = model.encode(sentences)
80
+ print(embeddings.shape)
81
+ ```
82
+
83
+ ## Validation Results
84
+
85
+ Comparison between vLLM and SentenceTransformers on identical inputs:
86
+ ```
87
+ Max embedding difference: 3.375e-7
88
+ Mean embedding difference: 1.136e-7
89
+ Cosine similarity matrices: Identical (np.allclose with default tolerances)
90
+ ```
91
+
92
+ This confirms **bit-level equivalence** within float32 precision limits.
93
+
94
+ ## Conversion
95
+
96
+ Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW)
97
+
98
+ **Conversion process:**
99
+ 1. Load original cointegrated/rubert-tiny2 weights
100
+ 2. Remove `bert.` prefix from weight names
101
+ 3. Remove unused heads (cls.*, bert.pooler.*)
102
+ 4. Keep query/key/value weights as-is (vLLM handles fusion automatically)
103
+
104
+ Tested on Google Colab Tesla T4 with:
105
+ - vLLM 0.11.2
106
+ - Transformers 4.57.2
107
+ - PyTorch 2.9.0+cu126
108
+
109
+ ## Original Model
110
+
111
+ For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)
112
+
113
+ This vLLM version is optimized for deployment scenarios requiring:
114
+ - High throughput batch processing
115
+ - Low latency inference
116
+ - OpenAI API compatibility
117
+ - Production-grade serving infrastructure