File size: 3,236 Bytes
8a62069
 
 
8cff888
8a62069
 
 
 
22e0853
 
8cff888
8a62069
22e0853
8a62069
 
22e0853
8a62069
22e0853
8a62069
c53d4db
8a62069
22e0853
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8a62069
 
 
22e0853
 
 
8a62069
 
 
 
 
 
 
 
 
 
 
 
 
22e0853
 
8a62069
22e0853
 
8a62069
 
22e0853
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4ff9e0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
language:
- ru
- en
pipeline_tag: sentence-similarity
tags:
- embeddings
- sentence-transformers
- vllm
- inference-optimized
- inference
license: mit
base_model: cointegrated/rubert-tiny2
---

# rubert-tiny2-vllm

**vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference.

This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching.

## Modifications

- **No weight changes** - uses original query/key/value weights directly
- vLLM automatically converts Q/K/V to fused qkv_proj format during loading
- Removed pretraining heads (MLM/NSP) - not needed for embeddings
- Changed architecture to `BertModel` for vLLM compatibility

## Usage

### vLLM Server
```bash
# IMPORTANT: Use fp32 for exact numerical match with original model
vllm serve WpythonW/rubert-tiny2-vllm --dtype float32
```

### OpenAI-compatible API
```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.embeddings.create(
    input="Привет мир",
    model="WpythonW/rubert-tiny2-vllm"
)
print(response.data[0].embedding[:5])
```

### Transformers
```python
import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm")
model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm")

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)
```

### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('WpythonW/rubert-tiny2-vllm')
sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(embeddings.shape)
```

## Validation Results

Comparison between vLLM and SentenceTransformers on identical inputs:
```
Max embedding difference: 3.375e-7
Mean embedding difference: 1.136e-7
Cosine similarity matrices: Identical (np.allclose with default tolerances)
```

This confirms **bit-level equivalence** within float32 precision limits.

## Conversion

Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW)

**Conversion process:**
1. Load original cointegrated/rubert-tiny2 weights
2. Remove `bert.` prefix from weight names
3. Remove unused heads (cls.*, bert.pooler.*)
4. Keep query/key/value weights as-is (vLLM handles fusion automatically)

Tested on Google Colab Tesla T4 with:
- vLLM 0.11.2
- Transformers 4.57.2
- PyTorch 2.9.0+cu126

## Original Model

For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2)