---
language:
- pt
license: apache-2.0
library_name: transformers
tags:
- modernbert
- portuguese
- encoder
- masked-lm
- long-context
- moberto
pipeline_tag: fill-mask
base_model: answerdotai/ModernBERT-base
datasets:
- HuggingFaceFW/fineweb-2
metrics:
- nDCG@10
- F1
model-index:
- name: moBERTo-orig-tokenizer
  results:
  - task:
      type: text-retrieval
      name: Reranking
    dataset:
      name: QUATI
      type: quati
    metrics:
    - type: nDCG@10
      value: 0.5231
  - task:
      type: text-retrieval
      name: Reranking
    dataset:
      name: mMARCO-PT
      type: mmarco-pt
    metrics:
    - type: nDCG@10
      value: 0.5089
  - task:
      type: text-retrieval
      name: Reranking
    dataset:
      name: Robust04-PT
      type: robust04-pt
    metrics:
    - type: nDCG@10
      value: 0.4516
  - task:
      type: text-retrieval
      name: Long-Context Reranking
    dataset:
      name: MLDR (PT)
      type: mldr
    metrics:
    - type: nDCG@10
      value: 0.6140
      name: nDCG@10 at 8192 tokens
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: LeNER-Br
      type: lener-br
    metrics:
    - type: F1
      value: 0.8587
  - task:
      type: text-classification
      name: Natural Language Understanding
    dataset:
      name: PLUE-PT
      type: plue-pt
    metrics:
    - type: F1
      value: 0.6910
---

# moBERTo-orig-tokenizer

> **Paper name:** This model is referred to as **`moBERTo-8k (orig. tok.)`** in the
> moBERTo paper. It is the variant that retains the **original ModernBERT tokenizer**,
> followed by long-context post-training.

`moBERTo-orig-tokenizer` is a Portuguese adaptation of
[ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base), obtained through
continued pretraining on a curated 12-billion-token Portuguese corpus (60B training
tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context.

Unlike the flagship [`moBERTo`](https://huggingface.co/Tropic-AI/moBERTo) variant,
this model **keeps the original ModernBERT tokenizer** rather than adopting a
Portuguese-optimized one. This makes it particularly strong for long-context
retrieval tasks (where it achieves the **best results in the moBERTo family** at
8,192 tokens on MLDR), at the cost of slightly weaker token-level performance
(NER and NLU).

The model preserves all architectural advances of ModernBERT: rotary positional
embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding,
with a native context window of **8,192 tokens**.

---

## Model Details

| Attribute              | Value                                                |
|------------------------|------------------------------------------------------|
| Architecture           | ModernBERT (encoder-only)                            |
| Base checkpoint        | `answerdotai/ModernBERT-base`                        |
| Parameters             | ~150M                                                |
| Max context length     | 8,192 tokens                                         |
| Tokenizer              | **Original ModernBERT tokenizer (unchanged)**        |
| Embedding init         | Inherited from ModernBERT-base                       |
| Pretraining tokens     | 60B (5 epochs over 12B-token corpus)                 |
| Long-context post-tr.  | 10B tokens at 8,192-token context                    |
| Training corpus        | FineWeb-2 (PT subset) filtered with ClassiCC-PT      |
| Framework              | Composer                                             |
| Precision              | bfloat16                                             |
| License                | Apache 2.0                                           |

---

## Quick Start

```python
long_text = "..."  # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
```

### Recommended for downstream tasks

This model is best used as a backbone for fine-tuning on:

- **Long-document retrieval** (its strongest use case)
- **Cross-encoder reranking** (information retrieval)
- **Document classification**
- **Named entity recognition**
- **Natural language inference / semantic textual similarity**

---

## Evaluation Results

All metrics are reported on Portuguese benchmarks. Best results are in **bold**;
second-best are <ins>underlined</ins>.

### Information Retrieval (Reranking, nDCG@10)

Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

| Model                                  | QUATI            | mMARCO           | Robust04         | **Avg.**         |
|----------------------------------------|------------------|------------------|------------------|------------------|
| BERT-base                              | 0.2846           | 0.4050           | 0.2389           | 0.3095           |
| BERTimbau-base                         | 0.4870           | 0.5005           | 0.4138           | 0.4671           |
| ModernBERT-base                        | 0.3779           | 0.4799           | 0.2988           | 0.3855           |
| NeoBERT-base                           | 0.4000           | 0.4698           | 0.3117           | 0.3938           |
| Qwen3-0.6B-base                        | 0.4248           | 0.5065           | 0.2994           | 0.4102           |
| moBERTo-orig-tokenizer-1k              | 0.5383           | 0.5109           | 0.4510           | 0.5001           |
| **moBERTo-orig-tokenizer (this model)**| 0.5231           | 0.5089           | 0.4516           | 0.4945           |
| moBERTo-1k                             | <ins>0.5410</ins>| **0.5169**       | <ins>0.4782</ins>| <ins>0.5120</ins>|
| moBERTo                                | **0.5609**       | <ins>0.5147</ins>| **0.5010**       | **0.5255**       |

### Long-Context Retrieval (MLDR, nDCG@10)

| Model                                  | 512              | 2,048            | 4,096            | 8,192            |
|----------------------------------------|------------------|------------------|------------------|------------------|
| ModernBERT-base                        | 0.4054           | 0.4206           | 0.3015           | 0.2867           |
| NeoBERT-base                           | 0.4746           | 0.5149           | 0.4676           | --               |
| Qwen3-0.6B-base                        | 0.3560           | 0.4023           | 0.4241           | 0.5351           |
| moBERTo-orig-tokenizer-1k              | **0.5834**       | <ins>0.5909</ins>| **0.6286**       | **0.6166**       |
| **moBERTo-orig-tokenizer (this model)**| 0.5674           | **0.6025**       | 0.5876           | <ins>0.6140</ins>|
| moBERTo-1k                             | 0.5466           | 0.4791           | 0.5714           | 0.5857           |
| moBERTo                                | <ins>0.5827</ins>| 0.5606           | <ins>0.5905</ins>| 0.5777           |

### Classification (F1)

- **Docs**: document type classification (news, legal, academic, etc.)
- **Educ.**: educational content detection

| Model                                  | Docs             | Educ.            | **Avg.**         |
|----------------------------------------|------------------|------------------|------------------|
| BERT-base                              | 0.8700           | 0.5690           | 0.7195           |
| BERTimbau-base                         | 0.8978           | 0.6382           | 0.7680           |
| ModernBERT-base                        | 0.8416           | 0.5730           | 0.7073           |
| NeoBERT-base                           | 0.8970           | 0.6266           | 0.7618           |
| Qwen3-0.6B-base                        | **0.9120**       | 0.6289           | 0.7705           |
| moBERTo-orig-tokenizer-1k              | 0.8942           | 0.6070           | 0.7506           |
| **moBERTo-orig-tokenizer (this model)**| 0.8962           | 0.6035           | 0.7499           |
| moBERTo-1k                             | 0.9024           | 0.6281           | 0.7653           |
| moBERTo                                | 0.9039           | <ins>0.6394</ins>| <ins>0.7717</ins>|
| NeoBERT-PT                             | 0.9030           | **0.6428**       | **0.7729**       |
| Qwen3-0.6B-PT                          | <ins>0.9070</ins>| 0.6311           | 0.7691           |

### NLU and NER (F1)

| Model                                  | PLUE-PT          | LeNER-Br         | GLUE (English)   |
|----------------------------------------|------------------|------------------|------------------|
| BERT-base                              | 0.6423           | 0.8500           | <ins>0.7815</ins>|
| BERTimbau-base                         | 0.6800           | **0.9040**       | 0.6772           |
| ModernBERT-base                        | 0.6420           | 0.8240           | **0.8301**       |
| NeoBERT-base                           | 0.6654           | 0.8590           | 0.7430           |
| Qwen3-0.6B-base                        | 0.6343           | 0.7020           | 0.7260           |
| moBERTo-orig-tokenizer-1k              | 0.6849           | 0.8371           | 0.7705           |
| **moBERTo-orig-tokenizer (this model)**| 0.6910           | 0.8587           | 0.7724           |
| moBERTo-1k                             | <ins>0.6959</ins>| 0.8710           | 0.7128           |
| moBERTo                                | **0.6980**       | 0.8726           | 0.7354           |
| NeoBERT-PT                             | 0.6842           | <ins>0.8840</ins>| 0.6620           |
| Qwen3-0.6B-PT                          | 0.6632           | 0.7100           | 0.7050           |


## Training Data

The pretraining corpus was curated from the Portuguese subset of
[FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further
filtered using the educational and STEM classifiers from ClassiCC-PT. The final
corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC,
covering a broad range of domains and topics in Portuguese.

The training data has been publicly released alongside the model.

---

## Training Procedure

### Phase 1 — Continued pretraining (60B tokens at 1,024 context)

| Parameter                | Value                       |
|--------------------------|-----------------------------|
| Training tokens          | 60B (5 epochs over 12B)     |
| Max sequence length      | 1,024                       |
| Batch size               | 4,608                       |
| Masking rate             | 30%                         |
| Optimizer                | StableAdamW                 |
| Learning rate            | 5e-4                        |
| Weight decay             | 1e-5                        |
| Dropout (attn output)    | 0.1                         |
| Dropout (other)          | 0.0                         |
| Precision                | bfloat16                    |
| RoPE base (global attn)  | 160,000                     |
| RoPE base (local attn)   | 10,000                      |

### Phase 2 — Long-context post-training (10B tokens at 8,192 context)

Same hyperparameters as Phase 1, except:

| Parameter                | Value                       |
|--------------------------|-----------------------------|
| Training tokens          | 10B                         |
| Max sequence length      | 8,192                       |
| Batch size               | 576                         |


## Related Models in the moBERTo Family

| Hugging Face Repo                                | Paper Name                  | Tokenizer | Long-ctx post-tr. |
|--------------------------------------------------|-----------------------------|-----------|-------------------|
| **`Tropic-AI/moBERTo-orig-tokenizer` *(this)***  | **moBERTo-8k (orig. tok.)** | Original  | **Yes**           |
| `Tropic-AI/moBERTo`                  | moBERTo-SWM-8k (PT tok.)    | PT (SWM)  | Yes               |

---

## Citation

```bibtex
@misc{laitz2026mobertomodernencoderportuguese,
      title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, 
      author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
      year={2026},
      eprint={2606.22722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22722}, 
}
```