embeddinggemma-pt-br-48k

A Portuguese-only text-embedding model, vocabulary-trimmed from google/embeddinggemma-300m to a 48k token vocabulary (~144M params, MTEB(por) mean_16 0.7098 = 97.8% of the full model at 47% of its size). No training — only the token embedding matrix was sliced; the transformer encoder and pooling/Dense heads are identical to the base model. Produced with 🛠️ embedding-vocab-trimmer.

Part of the embeddinggemma-pt-br family — 64k is the recommended sweet spot:

model params mean_16 % of full
google/embeddinggemma-300m ~308M 0.7257 100%
embeddinggemma-pt-br-128k ~207M 0.7192 99.1%
embeddinggemma-pt-br ~157M 0.7172 98.8%
embeddinggemma-pt-br-48k ~144M 0.7098 97.8%
embeddinggemma-pt-br-32k ~131M 0.6881 94.8%
embeddinggemma-pt-br-24k ~125M 0.6895 95.0%
embeddinggemma-pt-br-16k ~119M 0.6520 89.8%

Usage

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("tardellirs/embeddinggemma-pt-br-48k")
emb = model.encode(["O Brasil é um país tropical da América do Sul."], normalize_embeddings=True)

Uses EmbeddingGemma's task prompts (prepend task: search result | query: / title: none | text: for retrieval).

Scope

A compression of Google's EmbeddingGemma to Portuguese (deployment/efficiency artifact; data provenance is the base model's). Vocabulary trimming compresses, it does not enhance. Derived under the Gemma license.

Downloads last month
7
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tardellirs/embeddinggemma-pt-br-48k

Finetuned
(247)
this model

Collection including tardellirs/embeddinggemma-pt-br-48k