|
|
--- |
|
|
language: |
|
|
- ru |
|
|
- en |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- embeddings |
|
|
- sentence-transformers |
|
|
- vllm |
|
|
- inference-optimized |
|
|
- inference |
|
|
license: mit |
|
|
base_model: cointegrated/rubert-tiny2 |
|
|
--- |
|
|
|
|
|
# rubert-tiny2-vllm |
|
|
|
|
|
**vLLM-optimized version** of [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) for high-performance embedding inference. |
|
|
|
|
|
This model produces **numerically identical embeddings** to the original while enabling speedup through vLLM's optimized kernels and batching. |
|
|
|
|
|
## Modifications |
|
|
|
|
|
- **No weight changes** - uses original query/key/value weights directly |
|
|
- vLLM automatically converts Q/K/V to fused qkv_proj format during loading |
|
|
- Removed pretraining heads (MLM/NSP) - not needed for embeddings |
|
|
- Changed architecture to `BertModel` for vLLM compatibility |
|
|
|
|
|
## Usage |
|
|
|
|
|
### vLLM Server |
|
|
```bash |
|
|
# IMPORTANT: Use fp32 for exact numerical match with original model |
|
|
vllm serve WpythonW/rubert-tiny2-vllm --dtype float32 |
|
|
``` |
|
|
|
|
|
### OpenAI-compatible API |
|
|
```python |
|
|
from openai import OpenAI |
|
|
|
|
|
client = OpenAI( |
|
|
base_url="http://localhost:8000/v1", |
|
|
api_key="dummy" |
|
|
) |
|
|
|
|
|
response = client.embeddings.create( |
|
|
input="Привет мир", |
|
|
model="WpythonW/rubert-tiny2-vllm" |
|
|
) |
|
|
print(response.data[0].embedding[:5]) |
|
|
``` |
|
|
|
|
|
### Transformers |
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("WpythonW/rubert-tiny2-vllm") |
|
|
model = AutoModel.from_pretrained("WpythonW/rubert-tiny2-vllm") |
|
|
|
|
|
def embed_bert_cls(text, model, tokenizer): |
|
|
t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') |
|
|
with torch.no_grad(): |
|
|
model_output = model(**{k: v.to(model.device) for k, v in t.items()}) |
|
|
embeddings = model_output.last_hidden_state[:, 0, :] |
|
|
embeddings = torch.nn.functional.normalize(embeddings) |
|
|
return embeddings[0].cpu().numpy() |
|
|
|
|
|
print(embed_bert_cls('привет мир', model, tokenizer).shape) |
|
|
# (312,) |
|
|
``` |
|
|
|
|
|
### Sentence Transformers |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer('WpythonW/rubert-tiny2-vllm') |
|
|
sentences = ["привет мир", "hello world", "здравствуй вселенная"] |
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) |
|
|
``` |
|
|
|
|
|
## Validation Results |
|
|
|
|
|
Comparison between vLLM and SentenceTransformers on identical inputs: |
|
|
``` |
|
|
Max embedding difference: 3.375e-7 |
|
|
Mean embedding difference: 1.136e-7 |
|
|
Cosine similarity matrices: Identical (np.allclose with default tolerances) |
|
|
``` |
|
|
|
|
|
This confirms **bit-level equivalence** within float32 precision limits. |
|
|
|
|
|
## Conversion |
|
|
|
|
|
Full conversion notebook with validation: [Google Colab](https://colab.research.google.com/drive/1SS9qEayvwZU1r1khxq9tWf7iEZcxw2yW) |
|
|
|
|
|
**Conversion process:** |
|
|
1. Load original cointegrated/rubert-tiny2 weights |
|
|
2. Remove `bert.` prefix from weight names |
|
|
3. Remove unused heads (cls.*, bert.pooler.*) |
|
|
4. Keep query/key/value weights as-is (vLLM handles fusion automatically) |
|
|
|
|
|
Tested on Google Colab Tesla T4 with: |
|
|
- vLLM 0.11.2 |
|
|
- Transformers 4.57.2 |
|
|
- PyTorch 2.9.0+cu126 |
|
|
|
|
|
## Original Model |
|
|
|
|
|
For standard PyTorch/Transformers usage, see the original model: [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) |