Klein Embedding v1

A scratch-trained multilingual sentence embedding model (supports english, tamil, malayalam and hindi) focused on semantic similarity and retrieval. This model is built with a focus on transparency, efficiency, and reproducibility.

Model Overview

Model Type: RoBERTa-style encoder
Parameter Count: 35.05 Million
Hidden Dimension: 480
Layers: 7
Max Sequence Length: 128 tokens
Vocabulary Size: 32,000
License: Apache-2.0

Evaluation Results

The following metrics represent the model's performance on standard Semantic Textual Similarity (STS) benchmarks:

Dataset	Spearman	Pearson	Samples
STSb	40.54%	39.64%	1,379
SICK-R	51.69%	51.78%	9,927
STS12	42.59%	36.88%	3,108
STS13	37.76%	37.99%	1,500
STS14	36.99%	36.55%	3,750
STS15	52.29%	53.14%	3,000
STS16	50.35%	49.56%	1,186
Average	44.60%	43.65%	—

Technical Methodology

Architecture: Optimized RoBERTa-base with a reduced hidden dimension (480) and depth (7 layers) for high-speed inference without sacrificing semantic depth.
Tokenizer: Custom BPE tokenizer trained from scratch on balanced monolingual English and Indic corpora.
Training Objective: Contrastive learning using in-batch negatives and alignment loss to map similar sentences to a shared vector space.
Format: Distributed in Safetensors for secure and fast loading.

Quick Start (Usage)

Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch

model_id = "the-entropy-space-ai/klein-embedding-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

text = "Your sentence here"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)
    # Mean pooling to get a single 480-dimension vector
    embeddings = outputs.last_hidden_state.mean(dim=1)

print(embeddings.shape) # torch.Size([1, 480])

Project Philosophy

Efficiency First: At ~~133MB, this model is designed to run on standard CPUs with very low latency (~~78ms per sentence).
Full Control: Every step, from tokenizer normalization to the final contrastive loss, is documented and reproducible.
Transparent Limitations: This model is optimized for sentence similarity and retrieval, not for text generation or general-purpose LLM tasks.

Downloads last month: 1

Safetensors

Model size

35.1M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

the-entropy-space-ai
/

klein-embedding-v1