|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- pt |
|
|
tags: |
|
|
- modernbert |
|
|
- portuguese |
|
|
- bert |
|
|
- encoder |
|
|
- mlm |
|
|
- masked-language-model |
|
|
base_model: jhu-clsp/mmBERT-base |
|
|
datasets: |
|
|
- ClassiCC-Corpus/ClassiCC-PT |
|
|
- wikimedia/wikipedia |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# NeoBERTugues |
|
|
|
|
|
A Portuguese ModernBERT encoder model fine-tuned from [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) with a custom Portuguese-optimized tokenizer. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
NeoBERTugues is a Portuguese language model based on the ModernBERT architecture. It was created by: |
|
|
|
|
|
1. Training a Portuguese tokenizer with a novel techinique that blends higher quality data (in this case Portuguese Wikipedia) with the actual training data (in this case ClassiCC-PT) in order to reduce fertility and improve learning. |
|
|
2. Upcycling mmBERT's vocabulary embeddings for overlapping tokens, while throwing away tokens that were not relevant to Portuguese and randomly initializing the new tokens from the observed distribution of weights. |
|
|
3. Fine-tuning the model on Portuguese text corpora. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- **Architecture**: ModernBERT (22 layers, 768 hidden size, 12 attention heads) |
|
|
- **Max Sequence Length**: 8,192 tokens |
|
|
- **Vocabulary Size**: 32,000 tokens |
|
|
- **Training Data**: ClassiCC-PT corpus |
|
|
|
|
|
## Tokenizer Merging Approach |
|
|
|
|
|
A key innovation in NeoBERTugues is the tokenizer merging strategy: |
|
|
|
|
|
1. **Overlapping Tokens**: For tokens that exist in both the new NeoBERTugues and mmBERT vocabulary, the original mmBERT embedding weights are preserved. |
|
|
|
|
|
2. **Old tokens**: Tokens that exist in mmBERT tokenizer but not in the NeoBERTugues tokenizer were discarded. |
|
|
|
|
|
3. **New Token Initialization**: For NeoBERTugues-specific tokens not present in mmBERT, embeddings are initialized using a statistical distribution matching approach: |
|
|
- The embedding statistics (mean, standard deviation, skewness) of mmBERT's embeddings are computed |
|
|
- A skewed normal distribution is fitted to the decoder bias values of overlapping tokens |
|
|
- New token embeddings are sampled from distributions that match these statistics |
|
|
- This ensures new tokens start in a statistically similar space to existing tokens, enabling faster convergence |
|
|
|
|
|
This approach allows the model to leverage mmBERT's multilingual knowledge while adding Portuguese-specific vocabulary. |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
Unfortunately we were not able to run extensive benchmarks due to limited budget, so we're not claiming to have achieved SOTA results. Take these limited results with a grain of salt. |
|
|
|
|
|
Evaluation using logistic regression probing on Portuguese NLP benchmarks (F1 Macro scores): |
|
|
|
|
|
| Model | IMDB | Olist | BoolQ | MRPC (pt-BR) | RTE | Average | |
|
|
| ---------------- | ---------- | ---------- | ---------- | ------------ | ---------- | ---------- | |
|
|
| **NeoBERTugues** | **0.8753** | 0.9295 | **0.6177** | 0.5793 | 0.5655 | **0.7135** | |
|
|
| BERTugues | 0.8686 | 0.9313 | 0.6063 | 0.5705 | **0.5875** | 0.7128 | |
|
|
| BERTimbau | 0.8678 | **0.9322** | 0.6117 | **0.5944** | 0.5545 | 0.7121 | |
|
|
| ModBERTBr | 0.8443 | 0.9232 | 0.6162 | 0.5913 | 0.5736 | 0.7097 | |
|
|
|
|
|
**Benchmark Descriptions:** |
|
|
|
|
|
- **IMDB**: Portuguese IMDB movie review sentiment classification |
|
|
- **Olist**: Brazilian e-commerce (Olist) review sentiment analysis |
|
|
- **BoolQ**: Boolean question answering (ExtraGLUE pt-BR) |
|
|
- **MRPC**: Paraphrase detection (ExtraGLUE pt-BR) |
|
|
- **RTE**: Textual entailment (ExtraGLUE pt-BR) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Masked Language Modeling |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForMaskedLM.from_pretrained("lorenzocc/NeoBERTugues") |
|
|
tokenizer = AutoTokenizer.from_pretrained("lorenzocc/NeoBERTugues") |
|
|
|
|
|
# Create fill-mask pipeline |
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
|
|
|
# Example usage |
|
|
result = fill_mask("O Brasil é um país <mask>.") |
|
|
for r in result: |
|
|
print(f"{r['token_str']}: {r['score']:.1%}") |
|
|
``` |
|
|
|
|
|
### Feature Extraction |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained("lorenzocc/NeoBERTugues") |
|
|
tokenizer = AutoTokenizer.from_pretrained("lorenzocc/NeoBERTugues") |
|
|
|
|
|
text = "NeoBERTugues e um modelo de linguagem para portugues." |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Get sentence embedding (mean pooling) |
|
|
attention_mask = inputs["attention_mask"] |
|
|
last_hidden = outputs.last_hidden_state |
|
|
masked_hidden = last_hidden * attention_mask.unsqueeze(-1) |
|
|
sentence_embedding = masked_hidden.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True) |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
| Parameter | Value | |
|
|
| ------------------------------- | ------ | |
|
|
| Hidden size | 768 | |
|
|
| Intermediate size | 1,152 | |
|
|
| Number of attention heads | 12 | |
|
|
| Number of hidden layers | 22 | |
|
|
| Max position embeddings | 8,192 | |
|
|
| Vocabulary size | 32,000 | |
|
|
| Global attention every N layers | 3 | |
|
|
| Local attention window | 128 | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **ClassiCC-PT**: A large-scale Portuguese corpus (~97M samples) |
|
|
- **Portuguese Wikipedia**: Used for tokenizer training |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- **Masking Rate**: 30% |
|
|
- **Sequence Length**: 1,024 tokens |
|
|
- **Optimizer**: AdamW (beta1=0.90, beta2=0.98, epsilon=1e-6) |
|
|
- **Learning Rate**: 5e-5 |
|
|
- **Weight Decay**: 8e-5 |
|
|
- **Warmup Ratio**: 6% |
|
|
- **LR Schedule**: Warmup-Stable-Decay (1-sqrt decay) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{cesconetto2026neobertugues, |
|
|
author = {Cesconetto, Lorenzo}, |
|
|
title = {NeoBERTugues: A Portuguese ModernBERT Model}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/lorenzocc/NeoBERTugues} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Special thanks to [CloudWalk](https://www.cloudwalk.io) for making this possible through their AI Residency Program. |
|
|
|
|
|
- **Developed by**: Lorenzo Cesconetto |
|
|
- **Funded by**: CloudWalk, Inc. |
|
|
- **Base Model**: [mmBERT](https://huggingface.co/jhu-clsp/mmBERT-base) by JHU-CLSP |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|