the-entropy-space-ai/klein-embedding-data
Viewer • Updated • 14.3M • 350
A scratch-trained multilingual sentence embedding model (supports english, tamil, malayalam and hindi) focused on semantic similarity and retrieval. This model is built with a focus on transparency, efficiency, and reproducibility.
The following metrics represent the model's performance on standard Semantic Textual Similarity (STS) benchmarks:
| Dataset | Spearman | Pearson | Samples |
|---|---|---|---|
| STSb | 40.54% | 39.64% | 1,379 |
| SICK-R | 51.69% | 51.78% | 9,927 |
| STS12 | 42.59% | 36.88% | 3,108 |
| STS13 | 37.76% | 37.99% | 1,500 |
| STS14 | 36.99% | 36.55% | 3,750 |
| STS15 | 52.29% | 53.14% | 3,000 |
| STS16 | 50.35% | 49.56% | 1,186 |
| Average | 44.60% | 43.65% | — |
from transformers import AutoModel, AutoTokenizer
import torch
model_id = "the-entropy-space-ai/klein-embedding-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
text = "Your sentence here"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling to get a single 480-dimension vector
embeddings = outputs.last_hidden_state.mean(dim=1)
print(embeddings.shape) # torch.Size([1, 480])