lbourdois/fineweb-2-trimming
Preview • Updated • 1.97M • 1.52k • 1
How to use alphaedge-ai/multilingual-e5-base-pms-32768 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("alphaedge-ai/multilingual-e5-base-pms-32768")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]This model is a 60.0% smaller version of intfloat/multilingual-e5-base optimized for 32768 language via vocabulary pruning.
Total vocabulary size: 32768 tokens (reduced from 250002)
Tokenizer type: Unigram
Training samples per language: 200000 texts
Dataset: Lumberjackk/fineweb-2-trimming
This pruned model should perform similarly to the original model for 32768 with a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
You can use this model with the Transformers library:
from transformers import AutoModel, AutoTokenizer
model_name = "Lumberjackk/multilingual-e5-base-pms-32768"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Base model
intfloat/multilingual-e5-base