embeddinggemma-pt-br

A Portuguese-only text-embedding model, vocabulary-trimmed from google/embeddinggemma-300m. It keeps the 64k most frequent Portuguese tokens and drops the rest of the multilingual vocabulary — shrinking the model from ~308M to ~157M parameters (≈ half) while keeping 98.8% of the full model's MTEB(por) score. No training was involved — only the token embedding matrix was sliced; the transformer encoder and the pooling/Dense heads are identical to the base model.

Produced with the open-source tool 🛠️ embedding-vocab-trimmer.

Results — MTEB(por)

The full embeddinggemma-pt-br family (vocabulary size sweep; this model = the 64k sweet spot):

model	params	MTEB(por) `mean_16`	% of full
google/embeddinggemma-300m	~308M	0.7257	100%
embeddinggemma-pt-br-128k	~207M	0.7192	99.1%
embeddinggemma-pt-br (64k, this)	~157M	0.7172	98.8%
embeddinggemma-pt-br-48k	~144M	0.7098	97.8%
embeddinggemma-pt-br-32k	~131M	0.6881	94.8%
embeddinggemma-pt-br-24k	~125M	0.6895	95.0%
embeddinggemma-pt-br-16k	~119M	0.6520	89.8%

mean_16 = the 16 headline MTEB(por) tasks (classification, pair-classification, STS, clustering, retrieval, reranking). Full curve + charts in the tool's results.

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tardellirs/embeddinggemma-pt-br")
emb = model.encode(
    ["O Brasil é um país tropical da América do Sul.",
     "Operações matemáticas envolvem soma e multiplicação."],
    normalize_embeddings=True,
)
print(emb.shape)  # (2, 768)

This model inherits EmbeddingGemma's task-specific prompts. For retrieval, prepend the prompts from the base model card, e.g. task: search result | query: for queries and title: none | text: for documents. Supports Matryoshka output dims (768/512/256/128) via the base model's Dense heads.

How it was made

Mine Portuguese token frequencies → keep top-64k + functional specials → re-index the vocabulary → filter the BPE merges (keep A B → AB only if A, B and AB all survive) → slice embed_tokens.weight → reattach the original encoder + pooling/Dense. Reproduce:

python trim_vocab.py --model google/embeddinggemma-300m --corpus-config por \
    --vocab-size 64000 --output ./embeddinggemma-pt-br

Scope, transparency & limitations

This is a compression of Google's EmbeddingGemma. Its quality, training data and behaviour come entirely from the base model — this is a deployment / efficiency artifact. Data provenance is exactly that of EmbeddingGemma.
Vocabulary trimming compresses; it does not enhance. Fine-tuning, pruning and distillation from a larger teacher were all tried and reduced MTEB(por) — the base model is at its representational ceiling.
Portuguese only — other languages fall back to byte-level tokenization and will be poor.

License

Derived from EmbeddingGemma and distributed under the Gemma Terms of Use. The trimming tool is Apache-2.0.

Benchmark: MTEB(por) — leaderboard.

Downloads last month: 688

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for tardellirs/embeddinggemma-pt-br

Base model

google/embeddinggemma-300m

Finetuned

(265)

this model

Collection including tardellirs/embeddinggemma-pt-br

EmbeddingGemma PT-BR (vocabulary-trimmed)

Collection

Portuguese EmbeddingGemma-300M, vocab-trimmed 16k–128k. Training-free; 64k ≈ full quality at half the params. • 6 items • Updated Jun 7