Fill-Mask
Transformers
PyTorch
Safetensors
Portuguese
modernbert
portuguese
encoder
masked-lm
long-context
moberto
Eval Results (legacy)
Instructions to use Tropic-AI/moBERTo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tropic-AI/moBERTo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Tropic-AI/moBERTo")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Tropic-AI/moBERTo") model = AutoModelForMaskedLM.from_pretrained("Tropic-AI/moBERTo") - Notebooks
- Google Colab
- Kaggle
File size: 13,477 Bytes
6f800c6 162ee7b 713e55a 162ee7b 713e55a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 | ---
language:
- pt
license: apache-2.0
library_name: transformers
tags:
- modernbert
- portuguese
- encoder
- masked-lm
- long-context
- moberto
pipeline_tag: fill-mask
base_model: answerdotai/ModernBERT-base
datasets:
- HuggingFaceFW/fineweb-2
metrics:
- nDCG@10
- F1
model-index:
- name: moBERTo
results:
- task:
type: text-retrieval
name: Reranking
dataset:
name: QUATI
type: quati
metrics:
- type: nDCG@10
value: 0.5609
- task:
type: text-retrieval
name: Reranking
dataset:
name: mMARCO-PT
type: mmarco-pt
metrics:
- type: nDCG@10
value: 0.5147
- task:
type: text-retrieval
name: Reranking
dataset:
name: Robust04-PT
type: robust04-pt
metrics:
- type: nDCG@10
value: 0.5010
- task:
type: text-retrieval
name: Long-Context Reranking
dataset:
name: MLDR (PT)
type: mldr
metrics:
- type: nDCG@10
value: 0.5777
name: nDCG@10 at 8192 tokens
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: LeNER-Br
type: lener-br
metrics:
- type: F1
value: 0.8726
- task:
type: text-classification
name: Natural Language Understanding
dataset:
name: PLUE-PT
type: plue-pt
metrics:
- type: F1
value: 0.6980
---
# moBERTo
> **Paper name:** This model is referred to as **`moBERTo-SWM-8k (PT tok.)`** in the
> moBERTo paper. It is the **best-performing variant** of the moBERTo family, achieving
> the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks
> and the best PLUE-PT score.
`moBERTo` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base),
obtained through continued pretraining on a curated 12-billion-token Portuguese corpus
(60B training tokens, 5 epochs) followed by a long-context post-training phase at
8,192-token context.
It combines four adaptation strategies:
1. **Continued pretraining** from the original ModernBERT-base checkpoint, preserving
the long-context capabilities learned during the original 2T-token English pretraining.
2. **Portuguese tokenizer** with vocabulary optimized for Portuguese text.
3. **Subword Matching (SWM) embedding transfer**, which initializes each new Portuguese
token's embedding as a combination of the original ModernBERT subword embeddings,
keeping the model close to its pretrained representation space.
4. **Long-context post-training** at 8,192 tokens for an additional 10B tokens.
The model preserves all architectural advances of ModernBERT: rotary positional
embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding,
with a native context window of **8,192 tokens**.
---
## Model Details
| Attribute | Value |
|------------------------|------------------------------------------------------|
| Architecture | ModernBERT (encoder-only) |
| Base checkpoint | `answerdotai/ModernBERT-base` |
| Parameters | ~150M |
| Max context length | 8,192 tokens |
| Tokenizer | Portuguese (custom vocabulary) |
| Embedding init | Subword Matching (SWM) transfer |
| Pretraining tokens | 60B (5 epochs over 12B-token corpus) |
| Long-context post-tr. | 10B tokens at 8,192-token context |
| Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT |
| Framework | Composer |
| Precision | bfloat16 |
| License | Apache 2.0 |
---
## Quick Start
```python
long_text = "..." # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
```
### Recommended for downstream tasks
This model is best used as a backbone for fine-tuning on:
- **Cross-encoder reranking** (information retrieval)
- **Document classification**
- **Named entity recognition**
- **Natural language inference / semantic textual similarity**
- **Long-document retrieval** (up to 8,192 tokens)
---
## Evaluation Results
All metrics are reported on Portuguese benchmarks. Best results are in **bold**;
second-best are <ins>underlined</ins>.
### Information Retrieval (Reranking, nDCG@10)
Cross-encoder reranking, fine-tuned on mMARCO-PT triples.
| Model | QUATI | mMARCO | Robust04 | **Avg.** |
|-----------------------------|------------------|------------------|------------------|------------------|
| BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 |
| BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 |
| ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 |
| NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 |
| Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 |
| moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 |
| moBERTo-orig-tokenizer | 0.5231 | 0.5089 | 0.4516 | 0.4945 |
| moBERTo-1k | <ins>0.5410</ins>| **0.5169** | <ins>0.4782</ins>| <ins>0.5120</ins>|
| **moBERTo (this model)** | **0.5609** | <ins>0.5147</ins>| **0.5010** | **0.5255** |
### Long-Context Retrieval (MLDR, nDCG@10)
| Model | 512 | 2,048 | 4,096 | 8,192 |
|-----------------------------|------------------|------------------|------------------|------------------|
| ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 |
| NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- |
| Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 |
| moBERTo-orig-tokenizer-1k | **0.5834** | <ins>0.5909</ins>| **0.6286** | **0.6166** |
| moBERTo-orig-tokenizer | 0.5674 | **0.6025** | 0.5876 | <ins>0.6140</ins>|
| moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 |
| **moBERTo (this model)** | <ins>0.5827</ins>| 0.5606 | <ins>0.5905</ins>| 0.5777 |
### Classification (F1)
- **Docs**: document type classification (news, legal, academic, etc.)
- **Educ.**: educational content detection
| Model | Docs | Educ. | **Avg.** |
|-----------------------------|------------------|------------------|------------------|
| BERT-base | 0.8700 | 0.5690 | 0.7195 |
| BERTimbau-base | 0.8978 | 0.6382 | 0.7680 |
| ModernBERT-base | 0.8416 | 0.5730 | 0.7073 |
| NeoBERT-base | 0.8970 | 0.6266 | 0.7618 |
| Qwen3-0.6B-base | **0.9120** | 0.6289 | 0.7705 |
| moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 |
| moBERTo-orig-tokenizer | 0.8962 | 0.6035 | 0.7499 |
| moBERTo-1k | 0.9024 | 0.6281 | 0.7653 |
| **moBERTo (this model)** | 0.9039 | <ins>0.6394</ins>| <ins>0.7717</ins>|
| NeoBERT-PT | 0.9030 | **0.6428** | **0.7729** |
| Qwen3-0.6B-PT | <ins>0.9070</ins>| 0.6311 | 0.7691 |
### NLU and NER (F1)
| Model | PLUE-PT | LeNER-Br | GLUE (English) |
|-----------------------------|------------------|------------------|------------------|
| BERT-base | 0.6423 | 0.8500 | <ins>0.7815</ins>|
| BERTimbau-base | 0.6800 | **0.9040** | 0.6772 |
| ModernBERT-base | 0.6420 | 0.8240 | **0.8301** |
| NeoBERT-base | 0.6654 | 0.8590 | 0.7430 |
| Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 |
| moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 |
| moBERTo-orig-tokenizer | 0.6910 | 0.8587 | 0.7724 |
| moBERTo-1k | <ins>0.6959</ins>| 0.8710 | 0.7128 |
| **moBERTo (this model)** | **0.6980** | 0.8726 | 0.7354 |
| NeoBERT-PT | 0.6842 | <ins>0.8840</ins>| 0.6620 |
| Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 |
> **Note on GLUE:** As expected from continued pretraining on Portuguese, English
> performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301);
---
## Key Findings (from the paper's ablations)
1. **Continued pretraining > training from scratch.** Especially for long-context:
`moBERTo` achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline
trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining
provides representations that transfer effectively even when continued pretraining
itself uses only 1,024-token sequences.
2. **Tokenizer adaptation helps token-level tasks but disrupts long context.** Moving
to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops
by ~11 points without embedding transfer).
3. **SWM embedding transfer mitigates the long-context degradation.** By initializing
new Portuguese embeddings as combinations of the original subword embeddings, SWM
recovers most of the long-context performance lost by tokenizer adaptation alone.
4. **Long-context post-training yields the strongest reranker.** `moBERTo`
(this model) achieves the highest average reranking nDCG@10 (0.5255) and the best
PLUE-PT score (0.6980).
---
## Training Data
The pretraining corpus was curated from the Portuguese subset of
[FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further
filtered using the educational and STEM classifiers from ClassiCC-PT. The final
corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC,
covering a broad range of domains and topics in Portuguese.
The training data has been publicly released alongside the model.
---
## Training Procedure
### Phase 1 — Continued pretraining (60B tokens at 1,024 context)
| Parameter | Value |
|--------------------------|-----------------------------|
| Training tokens | 60B (5 epochs over 12B) |
| Max sequence length | 1,024 |
| Batch size | 4,608 |
| Masking rate | 30% |
| Optimizer | StableAdamW |
| Learning rate | 5e-4 |
| Weight decay | 1e-5 |
| Dropout (attn output) | 0.1 |
| Dropout (other) | 0.0 |
| Precision | bfloat16 |
| RoPE base (global attn) | 160,000 |
| RoPE base (local attn) | 10,000 |
### Phase 2 — Long-context post-training (10B tokens at 8,192 context)
Same hyperparameters as Phase 1, except:
| Parameter | Value |
|--------------------------|-----------------------------|
| Training tokens | 10B |
| Max sequence length | 8,192 |
| Batch size | 576 |
---
## Related Models in the moBERTo Family
| Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. |
|------------------------------------------------|-----------------------------|-----------|-------------------|
| `Tropic-AI/moBERTo-orig-tokenizer` | moBERTo-8k (orig. tok.) | Original | Yes |
| **`Tropic-AI/moBERTo` *(this)* * | **moBERTo-SWM-8k (PT tok.)**| PT (SWM) | **Yes** |
---
## Citation
```bibtex
@misc{laitz2026mobertomodernencoderportuguese,
title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT},
author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
year={2026},
eprint={2606.22722},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.22722},
}
``` |