File size: 6,330 Bytes
b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b7333af ab66708 b7333af ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b87a482 ab66708 b7333af ab66708 b87a482 ab66708 b87a482 90a4f05 ab66708 b87a482 ab66708 b87a482 ab66708 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 | ---
library_name: transformers
license: apache-2.0
language:
- pt
tags:
- modernbert
- portuguese
- bert
- encoder
- mlm
- masked-language-model
base_model: jhu-clsp/mmBERT-base
datasets:
- ClassiCC-Corpus/ClassiCC-PT
- wikimedia/wikipedia
pipeline_tag: fill-mask
---
# NeoBERTugues
A Portuguese ModernBERT encoder model fine-tuned from [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) with a custom Portuguese-optimized tokenizer.
## Model Description
NeoBERTugues is a Portuguese language model based on the ModernBERT architecture. It was created by:
1. Training a Portuguese tokenizer with a novel techinique that blends higher quality data (in this case Portuguese Wikipedia) with the actual training data (in this case ClassiCC-PT) in order to reduce fertility and improve learning.
2. Upcycling mmBERT's vocabulary embeddings for overlapping tokens, while throwing away tokens that were not relevant to Portuguese and randomly initializing the new tokens from the observed distribution of weights.
3. Fine-tuning the model on Portuguese text corpora.
### Key Features
- **Architecture**: ModernBERT (22 layers, 768 hidden size, 12 attention heads)
- **Max Sequence Length**: 8,192 tokens
- **Vocabulary Size**: 32,000 tokens
- **Training Data**: ClassiCC-PT corpus
## Tokenizer Merging Approach
A key innovation in NeoBERTugues is the tokenizer merging strategy:
1. **Overlapping Tokens**: For tokens that exist in both the new NeoBERTugues and mmBERT vocabulary, the original mmBERT embedding weights are preserved.
2. **Old tokens**: Tokens that exist in mmBERT tokenizer but not in the NeoBERTugues tokenizer were discarded.
3. **New Token Initialization**: For NeoBERTugues-specific tokens not present in mmBERT, embeddings are initialized using a statistical distribution matching approach:
- The embedding statistics (mean, standard deviation, skewness) of mmBERT's embeddings are computed
- A skewed normal distribution is fitted to the decoder bias values of overlapping tokens
- New token embeddings are sampled from distributions that match these statistics
- This ensures new tokens start in a statistically similar space to existing tokens, enabling faster convergence
This approach allows the model to leverage mmBERT's multilingual knowledge while adding Portuguese-specific vocabulary.
## Benchmark Results
Unfortunately we were not able to run extensive benchmarks due to limited budget, so we're not claiming to have achieved SOTA results. Take these limited results with a grain of salt.
Evaluation using logistic regression probing on Portuguese NLP benchmarks (F1 Macro scores):
| Model | IMDB | Olist | BoolQ | MRPC (pt-BR) | RTE | Average |
| ---------------- | ---------- | ---------- | ---------- | ------------ | ---------- | ---------- |
| **NeoBERTugues** | **0.8753** | 0.9295 | **0.6177** | 0.5793 | 0.5655 | **0.7135** |
| BERTugues | 0.8686 | 0.9313 | 0.6063 | 0.5705 | **0.5875** | 0.7128 |
| BERTimbau | 0.8678 | **0.9322** | 0.6117 | **0.5944** | 0.5545 | 0.7121 |
| ModBERTBr | 0.8443 | 0.9232 | 0.6162 | 0.5913 | 0.5736 | 0.7097 |
**Benchmark Descriptions:**
- **IMDB**: Portuguese IMDB movie review sentiment classification
- **Olist**: Brazilian e-commerce (Olist) review sentiment analysis
- **BoolQ**: Boolean question answering (ExtraGLUE pt-BR)
- **MRPC**: Paraphrase detection (ExtraGLUE pt-BR)
- **RTE**: Textual entailment (ExtraGLUE pt-BR)
## Usage
### Masked Language Modeling
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("lorenzocc/NeoBERTugues")
tokenizer = AutoTokenizer.from_pretrained("lorenzocc/NeoBERTugues")
# Create fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
# Example usage
result = fill_mask("O Brasil é um país <mask>.")
for r in result:
print(f"{r['token_str']}: {r['score']:.1%}")
```
### Feature Extraction
```python
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("lorenzocc/NeoBERTugues")
tokenizer = AutoTokenizer.from_pretrained("lorenzocc/NeoBERTugues")
text = "NeoBERTugues e um modelo de linguagem para portugues."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# Get sentence embedding (mean pooling)
attention_mask = inputs["attention_mask"]
last_hidden = outputs.last_hidden_state
masked_hidden = last_hidden * attention_mask.unsqueeze(-1)
sentence_embedding = masked_hidden.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
```
## Model Architecture
| Parameter | Value |
| ------------------------------- | ------ |
| Hidden size | 768 |
| Intermediate size | 1,152 |
| Number of attention heads | 12 |
| Number of hidden layers | 22 |
| Max position embeddings | 8,192 |
| Vocabulary size | 32,000 |
| Global attention every N layers | 3 |
| Local attention window | 128 |
## Training Details
### Training Data
- **ClassiCC-PT**: A large-scale Portuguese corpus (~97M samples)
- **Portuguese Wikipedia**: Used for tokenizer training
### Training Procedure
- **Masking Rate**: 30%
- **Sequence Length**: 1,024 tokens
- **Optimizer**: AdamW (beta1=0.90, beta2=0.98, epsilon=1e-6)
- **Learning Rate**: 5e-5
- **Weight Decay**: 8e-5
- **Warmup Ratio**: 6%
- **LR Schedule**: Warmup-Stable-Decay (1-sqrt decay)
## Citation
If you use this model, please cite:
```bibtex
@misc{cesconetto2026neobertugues,
author = {Cesconetto, Lorenzo},
title = {NeoBERTugues: A Portuguese ModernBERT Model},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/lorenzocc/NeoBERTugues}
}
```
## Acknowledgments
Special thanks to [CloudWalk](https://www.cloudwalk.io) for making this possible through their AI Residency Program.
- **Developed by**: Lorenzo Cesconetto
- **Funded by**: CloudWalk, Inc.
- **Base Model**: [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) by JHU-CLSP
## License
Apache 2.0
|