WpythonW
/

rubert-tiny2-vllm

@@ -3,40 +3,60 @@ language:
 - ru
 pipeline_tag: sentence-similarity
 tags:
 - russian
-- fill-mask
-- pretraining
 - embeddings
-- masked-lm
-- tiny
-- feature-extraction
-- sentence-similarity
 - sentence-transformers
-- transformers
 license: mit
-widget:
-- text: Миниатюрная модель для [MASK] разных задач.
 ---
-This is an updated version of [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny): a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.
-The differences from the previous version include:
-- a larger vocabulary: 83828 tokens instead of 29564;
-- larger supported sequences: 2048 instead of 512;
-- sentence embeddings approximate LaBSE closer than before;
-- meaningful segment embeddings (tuned on the NLI task)
-- the model is focused only on Russian.
-The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.
-Sentence embeddings can be produced as follows:
 ```python
-# pip install transformers sentencepiece
 import torch
 from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2")
-model = AutoModel.from_pretrained("cointegrated/rubert-tiny2")
-# model.cuda()  # uncomment it if you have a GPU
 def embed_bert_cls(text, model, tokenizer):
     t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
@@ -50,11 +70,48 @@ print(embed_bert_cls('привет мир', model, tokenizer).shape)
 # (312,)
 ```
-Alternatively, you can use the model with `sentence_transformers`:
-```Python
 from sentence_transformers import SentenceTransformer
-model = SentenceTransformer('cointegrated/rubert-tiny2')
 sentences = ["привет мир", "hello world", "здравствуй вселенная"]
 embeddings = model.encode(sentences)
-print(embeddings)
-```

 - ru
 pipeline_tag: sentence-similarity
 tags:
+- english
 - russian
 - embeddings
 - sentence-transformers
+- vllm
+- inference-optimized
 license: mit
+base_model: cointegrated/rubert-tiny2
 ---
+# rubert-tiny2-vllm
+**vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference.
+This model produces **numerically identical embeddings** to the original (max difference ~3e-7 due to float32 precision) while enabling significant speedup through vLLM's optimized kernels and batching.
+## Modifications
+- **No weight changes** - uses original query/key/value weights directly
+- vLLM automatically converts Q/K/V to fused qkv_proj format during loading
+- Removed pretraining heads (MLM/NSP) - not needed for embeddings
+- Changed architecture to `BertModel` for vLLM compatibility
+## Usage
+### vLLM Server
+```bash
+# IMPORTANT: Use fp32 for exact numerical match with original model
+vllm serve WpythonW/rubert-tiny2-vllm --dtype float32
+```
+### OpenAI-compatible API
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="dummy"
+)
+response = client.embeddings.create(
+    input="Привет мир",
+    model="WpythonW/rubert-tiny2-vllm"
+)
+print(response.data[0].embedding[:5])
+```
+### Transformers
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
+model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")
 def embed_bert_cls(text, model, tokenizer):
     t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
 # (312,)
 ```
+### Sentence Transformers
+```python
 from sentence_transformers import SentenceTransformer
+model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
 sentences = ["привет мир", "hello world", "здравствуй вселенная"]
 embeddings = model.encode(sentences)
+print(embeddings.shape)
+```
+## Validation Results
+Comparison between vLLM and SentenceTransformers on identical inputs:
+```
+Max embedding difference: 3.375e-7
+Mean embedding difference: 1.136e-7
+Cosine similarity matrices: Identical (np.allclose with default tolerances)
+```
+This confirms **bit-level equivalence** within float32 precision limits.
+## Conversion
+Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW)
+**Conversion process:**
+1. Load original cointegrated/rubert-tiny2 weights
+2. Remove `bert.` prefix from weight names
+3. Remove unused heads (cls.*, bert.pooler.*)
+4. Keep query/key/value weights as-is (vLLM handles fusion automatically)
+Tested on Google Colab Tesla T4 with:
+- vLLM 0.11.2
+- Transformers 4.57.2
+- PyTorch 2.9.0+cu126
+## Original Model
+For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)
+This vLLM version is optimized for deployment scenarios requiring:
+- High throughput batch processing
+- Low latency inference
+- OpenAI API compatibility
+- Production-grade serving infrastructure