File size: 13,477 Bytes

---
language:
- pt
license: apache-2.0
library_name: transformers
tags:
- modernbert
- portuguese
- encoder
- masked-lm
- long-context
- moberto
pipeline_tag: fill-mask
base_model: answerdotai/ModernBERT-base
datasets:
- HuggingFaceFW/fineweb-2
metrics:
- nDCG@10
- F1
model-index:
- name: moBERTo
  results:
  - task:
      type: text-retrieval
      name: Reranking
    dataset:
      name: QUATI
      type: quati
    metrics:
    - type: nDCG@10
      value: 0.5609
  - task:
      type: text-retrieval
      name: Reranking
    dataset:
      name: mMARCO-PT
      type: mmarco-pt
    metrics:
    - type: nDCG@10
      value: 0.5147
  - task:
      type: text-retrieval
      name: Reranking
    dataset:
      name: Robust04-PT
      type: robust04-pt
    metrics:
    - type: nDCG@10
      value: 0.5010
  - task:
      type: text-retrieval
      name: Long-Context Reranking
    dataset:
      name: MLDR (PT)
      type: mldr
    metrics:
    - type: nDCG@10
      value: 0.5777
      name: nDCG@10 at 8192 tokens
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: LeNER-Br
      type: lener-br
    metrics:
    - type: F1
      value: 0.8726
  - task:
      type: text-classification
      name: Natural Language Understanding
    dataset:
      name: PLUE-PT
      type: plue-pt
    metrics:
    - type: F1
      value: 0.6980
---

# moBERTo

> **Paper name:** This model is referred to as **`moBERTo-SWM-8k (PT tok.)`** in the
> moBERTo paper. It is the **best-performing variant** of the moBERTo family, achieving
> the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks
> and the best PLUE-PT score.

`moBERTo` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base),
obtained through continued pretraining on a curated 12-billion-token Portuguese corpus
(60B training tokens, 5 epochs) followed by a long-context post-training phase at
8,192-token context.

It combines four adaptation strategies:

1. **Continued pretraining** from the original ModernBERT-base checkpoint, preserving
   the long-context capabilities learned during the original 2T-token English pretraining.
2. **Portuguese tokenizer** with vocabulary optimized for Portuguese text.
3. **Subword Matching (SWM) embedding transfer**, which initializes each new Portuguese
   token's embedding as a combination of the original ModernBERT subword embeddings,
   keeping the model close to its pretrained representation space.
4. **Long-context post-training** at 8,192 tokens for an additional 10B tokens.

The model preserves all architectural advances of ModernBERT: rotary positional
embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding,
with a native context window of **8,192 tokens**.

---

## Model Details

| Attribute              | Value                                                |
|------------------------|------------------------------------------------------|
| Architecture           | ModernBERT (encoder-only)                            |
| Base checkpoint        | `answerdotai/ModernBERT-base`                        |
| Parameters             | ~150M                                                |
| Max context length     | 8,192 tokens                                         |
| Tokenizer              | Portuguese (custom vocabulary)                       |
| Embedding init         | Subword Matching (SWM) transfer                      |
| Pretraining tokens     | 60B (5 epochs over 12B-token corpus)                 |
| Long-context post-tr.  | 10B tokens at 8,192-token context                    |
| Training corpus        | FineWeb-2 (PT subset) filtered with ClassiCC-PT      |
| Framework              | Composer                                             |
| Precision              | bfloat16                                             |
| License                | Apache 2.0                                           |

---

## Quick Start

```python
long_text = "..."  # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
```

### Recommended for downstream tasks

This model is best used as a backbone for fine-tuning on:

- **Cross-encoder reranking** (information retrieval)
- **Document classification**
- **Named entity recognition**
- **Natural language inference / semantic textual similarity**
- **Long-document retrieval** (up to 8,192 tokens)

---

## Evaluation Results

All metrics are reported on Portuguese benchmarks. Best results are in **bold**;
second-best are <ins>underlined</ins>.

### Information Retrieval (Reranking, nDCG@10)

Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

| Model                       | QUATI            | mMARCO           | Robust04         | **Avg.**         |
|-----------------------------|------------------|------------------|------------------|------------------|
| BERT-base                   | 0.2846           | 0.4050           | 0.2389           | 0.3095           |
| BERTimbau-base              | 0.4870           | 0.5005           | 0.4138           | 0.4671           |
| ModernBERT-base             | 0.3779           | 0.4799           | 0.2988           | 0.3855           |
| NeoBERT-base                | 0.4000           | 0.4698           | 0.3117           | 0.3938           |
| Qwen3-0.6B-base             | 0.4248           | 0.5065           | 0.2994           | 0.4102           |
| moBERTo-orig-tokenizer-1k   | 0.5383           | 0.5109           | 0.4510           | 0.5001           |
| moBERTo-orig-tokenizer      | 0.5231           | 0.5089           | 0.4516           | 0.4945           |
| moBERTo-1k                  | <ins>0.5410</ins>| **0.5169**       | <ins>0.4782</ins>| <ins>0.5120</ins>|
| **moBERTo (this model)**    | **0.5609**       | <ins>0.5147</ins>| **0.5010**       | **0.5255**       |

### Long-Context Retrieval (MLDR, nDCG@10)

| Model                       | 512              | 2,048            | 4,096            | 8,192            |
|-----------------------------|------------------|------------------|------------------|------------------|
| ModernBERT-base             | 0.4054           | 0.4206           | 0.3015           | 0.2867           |
| NeoBERT-base                | 0.4746           | 0.5149           | 0.4676           | --               |
| Qwen3-0.6B-base             | 0.3560           | 0.4023           | 0.4241           | 0.5351           |
| moBERTo-orig-tokenizer-1k   | **0.5834**       | <ins>0.5909</ins>| **0.6286**       | **0.6166**       |
| moBERTo-orig-tokenizer      | 0.5674           | **0.6025**       | 0.5876           | <ins>0.6140</ins>|
| moBERTo-1k                  | 0.5466           | 0.4791           | 0.5714           | 0.5857           |
| **moBERTo (this model)**    | <ins>0.5827</ins>| 0.5606           | <ins>0.5905</ins>| 0.5777           |

### Classification (F1)

- **Docs**: document type classification (news, legal, academic, etc.)
- **Educ.**: educational content detection

| Model                       | Docs             | Educ.            | **Avg.**         |
|-----------------------------|------------------|------------------|------------------|
| BERT-base                   | 0.8700           | 0.5690           | 0.7195           |
| BERTimbau-base              | 0.8978           | 0.6382           | 0.7680           |
| ModernBERT-base             | 0.8416           | 0.5730           | 0.7073           |
| NeoBERT-base                | 0.8970           | 0.6266           | 0.7618           |
| Qwen3-0.6B-base             | **0.9120**       | 0.6289           | 0.7705           |
| moBERTo-orig-tokenizer-1k   | 0.8942           | 0.6070           | 0.7506           |
| moBERTo-orig-tokenizer      | 0.8962           | 0.6035           | 0.7499           |
| moBERTo-1k                  | 0.9024           | 0.6281           | 0.7653           |
| **moBERTo (this model)**    | 0.9039           | <ins>0.6394</ins>| <ins>0.7717</ins>|
| NeoBERT-PT                  | 0.9030           | **0.6428**       | **0.7729**       |
| Qwen3-0.6B-PT               | <ins>0.9070</ins>| 0.6311           | 0.7691           |

### NLU and NER (F1)

| Model                       | PLUE-PT          | LeNER-Br         | GLUE (English)   |
|-----------------------------|------------------|------------------|------------------|
| BERT-base                   | 0.6423           | 0.8500           | <ins>0.7815</ins>|
| BERTimbau-base              | 0.6800           | **0.9040**       | 0.6772           |
| ModernBERT-base             | 0.6420           | 0.8240           | **0.8301**       |
| NeoBERT-base                | 0.6654           | 0.8590           | 0.7430           |
| Qwen3-0.6B-base             | 0.6343           | 0.7020           | 0.7260           |
| moBERTo-orig-tokenizer-1k   | 0.6849           | 0.8371           | 0.7705           |
| moBERTo-orig-tokenizer      | 0.6910           | 0.8587           | 0.7724           |
| moBERTo-1k                  | <ins>0.6959</ins>| 0.8710           | 0.7128           |
| **moBERTo (this model)**    | **0.6980**       | 0.8726           | 0.7354           |
| NeoBERT-PT                  | 0.6842           | <ins>0.8840</ins>| 0.6620           |
| Qwen3-0.6B-PT               | 0.6632           | 0.7100           | 0.7050           |

> **Note on GLUE:** As expected from continued pretraining on Portuguese, English
> performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301);

---

## Key Findings (from the paper's ablations)

1. **Continued pretraining > training from scratch.** Especially for long-context:
   `moBERTo` achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline
   trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining
   provides representations that transfer effectively even when continued pretraining
   itself uses only 1,024-token sequences.
2. **Tokenizer adaptation helps token-level tasks but disrupts long context.** Moving
   to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops
   by ~11 points without embedding transfer).
3. **SWM embedding transfer mitigates the long-context degradation.** By initializing
   new Portuguese embeddings as combinations of the original subword embeddings, SWM
   recovers most of the long-context performance lost by tokenizer adaptation alone.
4. **Long-context post-training yields the strongest reranker.** `moBERTo`
   (this model) achieves the highest average reranking nDCG@10 (0.5255) and the best
   PLUE-PT score (0.6980).

---

## Training Data

The pretraining corpus was curated from the Portuguese subset of
[FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further
filtered using the educational and STEM classifiers from ClassiCC-PT. The final
corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC,
covering a broad range of domains and topics in Portuguese.

The training data has been publicly released alongside the model.

---

## Training Procedure

### Phase 1 — Continued pretraining (60B tokens at 1,024 context)

| Parameter                | Value                       |
|--------------------------|-----------------------------|
| Training tokens          | 60B (5 epochs over 12B)     |
| Max sequence length      | 1,024                       |
| Batch size               | 4,608                       |
| Masking rate             | 30%                         |
| Optimizer                | StableAdamW                 |
| Learning rate            | 5e-4                        |
| Weight decay             | 1e-5                        |
| Dropout (attn output)    | 0.1                         |
| Dropout (other)          | 0.0                         |
| Precision                | bfloat16                    |
| RoPE base (global attn)  | 160,000                     |
| RoPE base (local attn)   | 10,000                      |

### Phase 2 — Long-context post-training (10B tokens at 8,192 context)

Same hyperparameters as Phase 1, except:

| Parameter                | Value                       |
|--------------------------|-----------------------------|
| Training tokens          | 10B                         |
| Max sequence length      | 8,192                       |
| Batch size               | 576                         |

---

## Related Models in the moBERTo Family

| Hugging Face Repo                              | Paper Name                  | Tokenizer | Long-ctx post-tr. |
|------------------------------------------------|-----------------------------|-----------|-------------------|
| `Tropic-AI/moBERTo-orig-tokenizer`             | moBERTo-8k (orig. tok.)     | Original  | Yes               |
| **`Tropic-AI/moBERTo` *(this)*               * | **moBERTo-SWM-8k (PT tok.)**| PT (SWM)  | **Yes**           |

---

## Citation

```bibtex
@misc{laitz2026mobertomodernencoderportuguese,
      title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, 
      author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
      year={2026},
      eprint={2606.22722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22722}, 
}
```