---
language:
- pt
license: apache-2.0
library_name: transformers
tags:
- modernbert
- portuguese
- encoder
- masked-lm
- long-context
- moberto
pipeline_tag: fill-mask
base_model: answerdotai/ModernBERT-base
datasets:
- HuggingFaceFW/fineweb-2
metrics:
- nDCG@10
- F1
model-index:
- name: moBERTo
results:
- task:
type: text-retrieval
name: Reranking
dataset:
name: QUATI
type: quati
metrics:
- type: nDCG@10
value: 0.5609
- task:
type: text-retrieval
name: Reranking
dataset:
name: mMARCO-PT
type: mmarco-pt
metrics:
- type: nDCG@10
value: 0.5147
- task:
type: text-retrieval
name: Reranking
dataset:
name: Robust04-PT
type: robust04-pt
metrics:
- type: nDCG@10
value: 0.5010
- task:
type: text-retrieval
name: Long-Context Reranking
dataset:
name: MLDR (PT)
type: mldr
metrics:
- type: nDCG@10
value: 0.5777
name: nDCG@10 at 8192 tokens
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: LeNER-Br
type: lener-br
metrics:
- type: F1
value: 0.8726
- task:
type: text-classification
name: Natural Language Understanding
dataset:
name: PLUE-PT
type: plue-pt
metrics:
- type: F1
value: 0.6980
---
# moBERTo
> **Paper name:** This model is referred to as **`moBERTo-SWM-8k (PT tok.)`** in the
> moBERTo paper. It is the **best-performing variant** of the moBERTo family, achieving
> the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks
> and the best PLUE-PT score.
`moBERTo` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base),
obtained through continued pretraining on a curated 12-billion-token Portuguese corpus
(60B training tokens, 5 epochs) followed by a long-context post-training phase at
8,192-token context.
It combines four adaptation strategies:
1. **Continued pretraining** from the original ModernBERT-base checkpoint, preserving
the long-context capabilities learned during the original 2T-token English pretraining.
2. **Portuguese tokenizer** with vocabulary optimized for Portuguese text.
3. **Subword Matching (SWM) embedding transfer**, which initializes each new Portuguese
token's embedding as a combination of the original ModernBERT subword embeddings,
keeping the model close to its pretrained representation space.
4. **Long-context post-training** at 8,192 tokens for an additional 10B tokens.
The model preserves all architectural advances of ModernBERT: rotary positional
embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding,
with a native context window of **8,192 tokens**.
---
## Model Details
| Attribute | Value |
|------------------------|------------------------------------------------------|
| Architecture | ModernBERT (encoder-only) |
| Base checkpoint | `answerdotai/ModernBERT-base` |
| Parameters | ~150M |
| Max context length | 8,192 tokens |
| Tokenizer | Portuguese (custom vocabulary) |
| Embedding init | Subword Matching (SWM) transfer |
| Pretraining tokens | 60B (5 epochs over 12B-token corpus) |
| Long-context post-tr. | 10B tokens at 8,192-token context |
| Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT |
| Framework | Composer |
| Precision | bfloat16 |
| License | Apache 2.0 |
---
## Quick Start
```python
long_text = "..." # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
```
### Recommended for downstream tasks
This model is best used as a backbone for fine-tuning on:
- **Cross-encoder reranking** (information retrieval)
- **Document classification**
- **Named entity recognition**
- **Natural language inference / semantic textual similarity**
- **Long-document retrieval** (up to 8,192 tokens)
---
## Evaluation Results
All metrics are reported on Portuguese benchmarks. Best results are in **bold**;
second-best are underlined.
### Information Retrieval (Reranking, nDCG@10)
Cross-encoder reranking, fine-tuned on mMARCO-PT triples.
| Model | QUATI | mMARCO | Robust04 | **Avg.** |
|-----------------------------|------------------|------------------|------------------|------------------|
| BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 |
| BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 |
| ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 |
| NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 |
| Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 |
| moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 |
| moBERTo-orig-tokenizer | 0.5231 | 0.5089 | 0.4516 | 0.4945 |
| moBERTo-1k | 0.5410| **0.5169** | 0.4782| 0.5120|
| **moBERTo (this model)** | **0.5609** | 0.5147| **0.5010** | **0.5255** |
### Long-Context Retrieval (MLDR, nDCG@10)
| Model | 512 | 2,048 | 4,096 | 8,192 |
|-----------------------------|------------------|------------------|------------------|------------------|
| ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 |
| NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- |
| Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 |
| moBERTo-orig-tokenizer-1k | **0.5834** | 0.5909| **0.6286** | **0.6166** |
| moBERTo-orig-tokenizer | 0.5674 | **0.6025** | 0.5876 | 0.6140|
| moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 |
| **moBERTo (this model)** | 0.5827| 0.5606 | 0.5905| 0.5777 |
### Classification (F1)
- **Docs**: document type classification (news, legal, academic, etc.)
- **Educ.**: educational content detection
| Model | Docs | Educ. | **Avg.** |
|-----------------------------|------------------|------------------|------------------|
| BERT-base | 0.8700 | 0.5690 | 0.7195 |
| BERTimbau-base | 0.8978 | 0.6382 | 0.7680 |
| ModernBERT-base | 0.8416 | 0.5730 | 0.7073 |
| NeoBERT-base | 0.8970 | 0.6266 | 0.7618 |
| Qwen3-0.6B-base | **0.9120** | 0.6289 | 0.7705 |
| moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 |
| moBERTo-orig-tokenizer | 0.8962 | 0.6035 | 0.7499 |
| moBERTo-1k | 0.9024 | 0.6281 | 0.7653 |
| **moBERTo (this model)** | 0.9039 | 0.6394| 0.7717|
| NeoBERT-PT | 0.9030 | **0.6428** | **0.7729** |
| Qwen3-0.6B-PT | 0.9070| 0.6311 | 0.7691 |
### NLU and NER (F1)
| Model | PLUE-PT | LeNER-Br | GLUE (English) |
|-----------------------------|------------------|------------------|------------------|
| BERT-base | 0.6423 | 0.8500 | 0.7815|
| BERTimbau-base | 0.6800 | **0.9040** | 0.6772 |
| ModernBERT-base | 0.6420 | 0.8240 | **0.8301** |
| NeoBERT-base | 0.6654 | 0.8590 | 0.7430 |
| Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 |
| moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 |
| moBERTo-orig-tokenizer | 0.6910 | 0.8587 | 0.7724 |
| moBERTo-1k | 0.6959| 0.8710 | 0.7128 |
| **moBERTo (this model)** | **0.6980** | 0.8726 | 0.7354 |
| NeoBERT-PT | 0.6842 | 0.8840| 0.6620 |
| Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 |
> **Note on GLUE:** As expected from continued pretraining on Portuguese, English
> performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301);
---
## Key Findings (from the paper's ablations)
1. **Continued pretraining > training from scratch.** Especially for long-context:
`moBERTo` achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline
trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining
provides representations that transfer effectively even when continued pretraining
itself uses only 1,024-token sequences.
2. **Tokenizer adaptation helps token-level tasks but disrupts long context.** Moving
to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops
by ~11 points without embedding transfer).
3. **SWM embedding transfer mitigates the long-context degradation.** By initializing
new Portuguese embeddings as combinations of the original subword embeddings, SWM
recovers most of the long-context performance lost by tokenizer adaptation alone.
4. **Long-context post-training yields the strongest reranker.** `moBERTo`
(this model) achieves the highest average reranking nDCG@10 (0.5255) and the best
PLUE-PT score (0.6980).
---
## Training Data
The pretraining corpus was curated from the Portuguese subset of
[FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further
filtered using the educational and STEM classifiers from ClassiCC-PT. The final
corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC,
covering a broad range of domains and topics in Portuguese.
The training data has been publicly released alongside the model.
---
## Training Procedure
### Phase 1 — Continued pretraining (60B tokens at 1,024 context)
| Parameter | Value |
|--------------------------|-----------------------------|
| Training tokens | 60B (5 epochs over 12B) |
| Max sequence length | 1,024 |
| Batch size | 4,608 |
| Masking rate | 30% |
| Optimizer | StableAdamW |
| Learning rate | 5e-4 |
| Weight decay | 1e-5 |
| Dropout (attn output) | 0.1 |
| Dropout (other) | 0.0 |
| Precision | bfloat16 |
| RoPE base (global attn) | 160,000 |
| RoPE base (local attn) | 10,000 |
### Phase 2 — Long-context post-training (10B tokens at 8,192 context)
Same hyperparameters as Phase 1, except:
| Parameter | Value |
|--------------------------|-----------------------------|
| Training tokens | 10B |
| Max sequence length | 8,192 |
| Batch size | 576 |
---
## Related Models in the moBERTo Family
| Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. |
|------------------------------------------------|-----------------------------|-----------|-------------------|
| `Tropic-AI/moBERTo-orig-tokenizer` | moBERTo-8k (orig. tok.) | Original | Yes |
| **`Tropic-AI/moBERTo` *(this)* * | **moBERTo-SWM-8k (PT tok.)**| PT (SWM) | **Yes** |
---
## Citation
```bibtex
@misc{laitz2026mobertomodernencoderportuguese,
title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT},
author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
year={2026},
eprint={2606.22722},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.22722},
}
```