NeoBERTugues

A Portuguese ModernBERT encoder model fine-tuned from mmBERT with a custom Portuguese-optimized tokenizer.

Model Description

NeoBERTugues is a Portuguese language model based on the ModernBERT architecture. It was created by:

Training a Portuguese tokenizer with a novel techinique that blends higher quality data (in this case Portuguese Wikipedia) with the actual training data (in this case ClassiCC-PT) in order to reduce fertility and improve learning.
Upcycling mmBERT's vocabulary embeddings for overlapping tokens, while throwing away tokens that were not relevant to Portuguese and randomly initializing the new tokens from the observed distribution of weights.
Fine-tuning the model on Portuguese text corpora.

Key Features

Architecture: ModernBERT (22 layers, 768 hidden size, 12 attention heads)
Max Sequence Length: 8,192 tokens
Vocabulary Size: 32,000 tokens
Training Data: ClassiCC-PT corpus

Tokenizer Merging Approach

A key innovation in NeoBERTugues is the tokenizer merging strategy:

Overlapping Tokens: For tokens that exist in both the new NeoBERTugues and mmBERT vocabulary, the original mmBERT embedding weights are preserved.
Old tokens: Tokens that exist in mmBERT tokenizer but not in the NeoBERTugues tokenizer were discarded.
New Token Initialization: For NeoBERTugues-specific tokens not present in mmBERT, embeddings are initialized using a statistical distribution matching approach:
- The embedding statistics (mean, standard deviation, skewness) of mmBERT's embeddings are computed
- A skewed normal distribution is fitted to the decoder bias values of overlapping tokens
- New token embeddings are sampled from distributions that match these statistics
- This ensures new tokens start in a statistically similar space to existing tokens, enabling faster convergence

This approach allows the model to leverage mmBERT's multilingual knowledge while adding Portuguese-specific vocabulary.

Benchmark Results

Unfortunately we were not able to run extensive benchmarks due to limited budget, so we're not claiming to have achieved SOTA results. Take these limited results with a grain of salt.

Evaluation using logistic regression probing on Portuguese NLP benchmarks (F1 Macro scores):

Model	IMDB	Olist	BoolQ	MRPC (pt-BR)	RTE	Average
NeoBERTugues	0.8753	0.9295	0.6177	0.5793	0.5655	0.7135
BERTugues	0.8686	0.9313	0.6063	0.5705	0.5875	0.7128
BERTimbau	0.8678	0.9322	0.6117	0.5944	0.5545	0.7121
ModBERTBr	0.8443	0.9232	0.6162	0.5913	0.5736	0.7097

Benchmark Descriptions:

IMDB: Portuguese IMDB movie review sentiment classification
Olist: Brazilian e-commerce (Olist) review sentiment analysis
BoolQ: Boolean question answering (ExtraGLUE pt-BR)
MRPC: Paraphrase detection (ExtraGLUE pt-BR)
RTE: Textual entailment (ExtraGLUE pt-BR)

Usage

Masked Language Modeling

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("lorenzocc/NeoBERTugues")
tokenizer = AutoTokenizer.from_pretrained("lorenzocc/NeoBERTugues")

# Create fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

# Example usage
result = fill_mask("O Brasil é um país <mask>.")
for r in result:
    print(f"{r['token_str']}: {r['score']:.1%}")

Feature Extraction

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("lorenzocc/NeoBERTugues")
tokenizer = AutoTokenizer.from_pretrained("lorenzocc/NeoBERTugues")

text = "NeoBERTugues e um modelo de linguagem para portugues."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)

# Get sentence embedding (mean pooling)
attention_mask = inputs["attention_mask"]
last_hidden = outputs.last_hidden_state
masked_hidden = last_hidden * attention_mask.unsqueeze(-1)
sentence_embedding = masked_hidden.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)

Model Architecture

Parameter	Value
Hidden size	768
Intermediate size	1,152
Number of attention heads	12
Number of hidden layers	22
Max position embeddings	8,192
Vocabulary size	32,000
Global attention every N layers	3
Local attention window	128

Training Details

Training Data

ClassiCC-PT: A large-scale Portuguese corpus (~97M samples)
Portuguese Wikipedia: Used for tokenizer training

Training Procedure

Masking Rate: 30%
Sequence Length: 1,024 tokens
Optimizer: AdamW (beta1=0.90, beta2=0.98, epsilon=1e-6)
Learning Rate: 5e-5
Weight Decay: 8e-5
Warmup Ratio: 6%
LR Schedule: Warmup-Stable-Decay (1-sqrt decay)

Citation

If you use this model, please cite:

@misc{cesconetto2026neobertugues,
  author = {Cesconetto, Lorenzo},
  title = {NeoBERTugues: A Portuguese ModernBERT Model},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/lorenzocc/NeoBERTugues}
}

Acknowledgments

Special thanks to CloudWalk for making this possible through their AI Residency Program.

Developed by: Lorenzo Cesconetto
Funded by: CloudWalk, Inc.
Base Model: mmBERT by JHU-CLSP

License

Apache 2.0

Downloads last month: 8

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for lorenzocc/NeoBERTugues

Base model

jhu-clsp/mmBERT-base

Finetuned

(108)

this model

lorenzocc
/

NeoBERTugues