braza-embedding-ptbr-v1

The first PT-BR embedding model fine-tuned on 474,000 real Brazilian B2B companies.

Fine-tuned from ibm-granite/granite-embedding-97m-multilingual-r2 using a Karpathy-style autoresearch loop β€” 36 autonomous training iterations on an RTX 5090, each proposing and self-validating its own strategy. 35 out of 36 iterations improved the model (97% acceptance rate).

Built at TAMZ β€” a Brazilian B2B sales intelligence platform that identifies, enriches, and delivers company leads ready for outreach. The training data comes directly from TAMZ's enrichment pipeline over 32M Brazilian companies from the Receita Federal.


MTEB Benchmark Results (PT-BR)

Task Baseline (granite-97m) braza-embedding-ptbr-v1 Ξ”
ASSIN2 STS 0.6655 0.8082 +21.5% 🟒
SICK-BR STS 0.7062 0.8513 +20.6% 🟒
ASSIN2 RTE 0.7254 0.8408 +16.0% 🟒
MTEB Primary (weighted avg) 0.5826 0.6596 +13.2%

STS (Semantic Textual Similarity) scores represent Spearman correlation with human judgements β€” the gold standard for measuring semantic embedding quality in Portuguese.


Why This Model Exists

There are virtually no PT-BR embedding models trained on real business data. Most multilingual models learn from Wikipedia and web crawls β€” they don't understand the vocabulary of Brazilian B2B:

  • "distribuidora de insumos industriais no Nordeste"
  • "SaaS de gestΓ£o condominial para construtoras"
  • "startup de fintech para MEI e pequenas empresas"
  • "consultorias de RH para empresas de mΓ©dio porte em SP"

This model was built to fix that.


Model Details

Property Value
Base model ibm-granite/granite-embedding-97m-multilingual-r2
Architecture ModernBERT
Parameters 97M
Embedding dimension 384 (Matryoshka: supports 256 / 128 / 64)
Max sequence length 512 tokens
Language Portuguese (PT-BR)
Domain B2B, SaaS, tech, startups, comercial, empresas brasileiras
Training hardware NVIDIA RTX 5090 32GB
Training time ~8 hours overnight

Training: The Autoresearch Loop

Instead of static hyperparameter tuning, this model was trained using an autonomous iterative loop inspired by Karpathy's autoresearch approach:

for each of 36 iterations:
  1. Generate 150 company β†’ query pairs using Qwen3.5-35B
  2. Stage 1: MatryoshkaLoss(MultipleNegativesRankingLoss, scale=30)
              β†’ 180s on RTX 5090
              β†’ trains 384/256/128/64 dims simultaneously
  3. Stage 2: CoSENTLoss on ASSIN2 + SICK-BR real annotated scores
              β†’ 60s calibration on human-labelled PT-BR pairs
  4. Evaluate: ASSIN2-STS + SICK-BR-STS
  5. KEEP if improved β†’ checkpoint saved
     DISCARD if regressed β†’ restore previous checkpoint

Score progression over 36 iterations:

iter  0: 0.7322
iter  5: 0.7589
iter 10: 0.7831
iter 20: 0.8105
iter 30: 0.8270
iter 32: 0.8297  ← best checkpoint (this model)

Training Data: 474K Brazilian B2B Companies

The synthetic training data was generated from a proprietary dataset of 474,000 enriched Brazilian companies sourced from the Receita Federal (Brazilian tax authority) and enriched with web scraping + AI analysis.

Each company record contains:

  • Business description and value proposition (AI-generated from web scraping)
  • CNAE sector classification, company size, location, revenue range
  • Tech stack, AI adoption score, target market (B2B/B2C/B2B2C)
  • LinkedIn data, founding year, company stage

For each company, Qwen3.5-35B APEX generated 6 diverse semantic queries β€” simulating real B2B buyers and sales reps searching for vendors. Queries vary by:

  • Geographic filters ("empresa de TI em Curitiba", "fornecedor no Nordeste")
  • Company maturity ("startup de 1 ano", "empresa estabelecida hΓ‘ mais de 10 anos")
  • Decision-maker role ("CTO buscando", "gestor de compras procurando")
  • Tech signals ("empresa que usa Salesforce e HubSpot")
  • Revenue/size filters ("faturamento entre R$5M e R$20M")

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("calneymgp/braza-embedding-ptbr-v1")

# B2B semantic search
query = "empresa de tecnologia SaaS B2B em SΓ£o Paulo"
companies = [
    "Desenvolvemos software de gestΓ£o para pequenas empresas brasileiras",
    "Restaurante especializado em culinΓ‘ria italiana no centro de SP",
    "Plataforma de CRM para times de vendas corporativas B2B",
    "Distribuidora de equipamentos agrΓ­colas no interior do ParanΓ‘",
]

query_emb = model.encode(query)
company_embs = model.encode(companies)
scores = model.similarity(query_emb, company_embs)

# Results: [0.81, 0.12, 0.79, 0.18] β€” retrieves tech SaaS companies correctly

Matryoshka β€” flexible embedding dimensions

# Full quality (384-dim) β€” default
embeddings = model.encode(texts)

# 2x faster search, ~1% quality loss (256-dim)
embeddings = model.encode(texts, truncate_dim=256)

# 9x faster search, ~3% quality loss (128-dim) β€” great for large-scale search
embeddings = model.encode(texts, truncate_dim=128)

Best For

βœ… Semantic search over Brazilian business data
βœ… B2B lead discovery and company matching (e.g. TAMZ) βœ… Company similarity, clustering, deduplication
βœ… PT-BR RAG pipelines with business documents
βœ… Memory systems for Portuguese AI agents
βœ… Sales intelligence and market research (Brazil)


Infrastructure

Component Details
Data generation Qwen3.5-35B APEX GGUF on RTX 3090 (llama.cpp)
Training NVIDIA RTX 5090 32GB (PyTorch + sentence-transformers 5.x)
Evaluation MTEB 2.x β€” official PT-BR benchmark tasks
Monitoring Discord notifications + HTTP dashboard per iteration
Loop controller Custom autoresearch script (KEEP/DISCARD per iteration)

License

Apache 2.0 β€” same as the base IBM Granite Embedding model.


Citation

@misc{gerhardt2026braza,
  author    = {Calney Gerhardt},
  title     = {braza-embedding-ptbr-v1: Portuguese B2B Embedding via Autoresearch},
  year      = {2026},
  publisher = {HuggingFace},
  url       = {https://huggingface.co/calneymgp/braza-embedding-ptbr-v1},
  note      = {Fine-tuned from IBM Granite 97M on 474K Brazilian B2B companies
               using 36-iteration autonomous training loop (RTX 5090).
               Built at TAMZ (https://tamz.ai)}
}
Downloads last month
23
Safetensors
Model size
97.4M params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for calneymgp/braza-embedding-ptbr-v1

Finetuned
(3)
this model

Evaluation results