rubert-tiny2-vllm / README.md
WpythonW's picture
Update README.md
8cff888 verified
---
language:
- ru
- en
pipeline_tag: sentence-similarity
tags:
- embeddings
- sentence-transformers
- vllm
- inference-optimized
- inference
license: mit
base_model: cointegrated/rubert-tiny2
---
# rubert-tiny2-vllm
**vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference.
This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching.
## Modifications
- **No weight changes** - uses original query/key/value weights directly
- vLLM automatically converts Q/K/V to fused qkv_proj format during loading
- Removed pretraining heads (MLM/NSP) - not needed for embeddings
- Changed architecture to `BertModel` for vLLM compatibility
## Usage
### vLLM Server
```bash
# IMPORTANT: Use fp32 for exact numerical match with original model
vllm serve WpythonW/rubert-tiny2-vllm --dtype float32
```
### OpenAI-compatible API
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
response = client.embeddings.create(
input="Привет мир",
model="WpythonW/rubert-tiny2-vllm"
)
print(response.data[0].embedding[:5])
```
### Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")
def embed_bert_cls(text, model, tokenizer):
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**{k: v.to(model.device) for k, v in t.items()})
embeddings = model_output.last_hidden_state[:, 0, :]
embeddings = torch.nn.functional.normalize(embeddings)
return embeddings[0].cpu().numpy()
print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)
```
### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings.shape)
```
## Validation Results
Comparison between vLLM and SentenceTransformers on identical inputs:
```
Max embedding difference: 3.375e-7
Mean embedding difference: 1.136e-7
Cosine similarity matrices: Identical (np.allclose with default tolerances)
```
This confirms **bit-level equivalence** within float32 precision limits.
## Conversion
Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW)
**Conversion process:**
1. Load original cointegrated/rubert-tiny2 weights
2. Remove `bert.` prefix from weight names
3. Remove unused heads (cls.*, bert.pooler.*)
4. Keep query/key/value weights as-is (vLLM handles fusion automatically)
Tested on Google Colab Tesla T4 with:
- vLLM 0.11.2
- Transformers 4.57.2
- PyTorch 2.9.0+cu126
## Original Model
For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)