Fill-Mask
Transformers
PyTorch
Safetensors
Portuguese
modernbert
portuguese
encoder
masked-lm
long-context
moberto
Eval Results (legacy)
Instructions to use Tropic-AI/moBERTo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tropic-AI/moBERTo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Tropic-AI/moBERTo")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Tropic-AI/moBERTo") model = AutoModelForMaskedLM.from_pretrained("Tropic-AI/moBERTo") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - pt | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - modernbert | |
| - portuguese | |
| - encoder | |
| - masked-lm | |
| - long-context | |
| - moberto | |
| pipeline_tag: fill-mask | |
| base_model: answerdotai/ModernBERT-base | |
| datasets: | |
| - HuggingFaceFW/fineweb-2 | |
| metrics: | |
| - nDCG@10 | |
| - F1 | |
| model-index: | |
| - name: moBERTo | |
| results: | |
| - task: | |
| type: text-retrieval | |
| name: Reranking | |
| dataset: | |
| name: QUATI | |
| type: quati | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.5609 | |
| - task: | |
| type: text-retrieval | |
| name: Reranking | |
| dataset: | |
| name: mMARCO-PT | |
| type: mmarco-pt | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.5147 | |
| - task: | |
| type: text-retrieval | |
| name: Reranking | |
| dataset: | |
| name: Robust04-PT | |
| type: robust04-pt | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.5010 | |
| - task: | |
| type: text-retrieval | |
| name: Long-Context Reranking | |
| dataset: | |
| name: MLDR (PT) | |
| type: mldr | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.5777 | |
| name: nDCG@10 at 8192 tokens | |
| - task: | |
| type: token-classification | |
| name: Named Entity Recognition | |
| dataset: | |
| name: LeNER-Br | |
| type: lener-br | |
| metrics: | |
| - type: F1 | |
| value: 0.8726 | |
| - task: | |
| type: text-classification | |
| name: Natural Language Understanding | |
| dataset: | |
| name: PLUE-PT | |
| type: plue-pt | |
| metrics: | |
| - type: F1 | |
| value: 0.6980 | |
| # moBERTo | |
| > **Paper name:** This model is referred to as **`moBERTo-SWM-8k (PT tok.)`** in the | |
| > moBERTo paper. It is the **best-performing variant** of the moBERTo family, achieving | |
| > the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks | |
| > and the best PLUE-PT score. | |
| `moBERTo` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base), | |
| obtained through continued pretraining on a curated 12-billion-token Portuguese corpus | |
| (60B training tokens, 5 epochs) followed by a long-context post-training phase at | |
| 8,192-token context. | |
| It combines four adaptation strategies: | |
| 1. **Continued pretraining** from the original ModernBERT-base checkpoint, preserving | |
| the long-context capabilities learned during the original 2T-token English pretraining. | |
| 2. **Portuguese tokenizer** with vocabulary optimized for Portuguese text. | |
| 3. **Subword Matching (SWM) embedding transfer**, which initializes each new Portuguese | |
| token's embedding as a combination of the original ModernBERT subword embeddings, | |
| keeping the model close to its pretrained representation space. | |
| 4. **Long-context post-training** at 8,192 tokens for an additional 10B tokens. | |
| The model preserves all architectural advances of ModernBERT: rotary positional | |
| embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, | |
| with a native context window of **8,192 tokens**. | |
| --- | |
| ## Model Details | |
| | Attribute | Value | | |
| |------------------------|------------------------------------------------------| | |
| | Architecture | ModernBERT (encoder-only) | | |
| | Base checkpoint | `answerdotai/ModernBERT-base` | | |
| | Parameters | ~150M | | |
| | Max context length | 8,192 tokens | | |
| | Tokenizer | Portuguese (custom vocabulary) | | |
| | Embedding init | Subword Matching (SWM) transfer | | |
| | Pretraining tokens | 60B (5 epochs over 12B-token corpus) | | |
| | Long-context post-tr. | 10B tokens at 8,192-token context | | |
| | Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT | | |
| | Framework | Composer | | |
| | Precision | bfloat16 | | |
| | License | Apache 2.0 | | |
| --- | |
| ## Quick Start | |
| ```python | |
| long_text = "..." # documento longo em português | |
| inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt") | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| ``` | |
| ### Recommended for downstream tasks | |
| This model is best used as a backbone for fine-tuning on: | |
| - **Cross-encoder reranking** (information retrieval) | |
| - **Document classification** | |
| - **Named entity recognition** | |
| - **Natural language inference / semantic textual similarity** | |
| - **Long-document retrieval** (up to 8,192 tokens) | |
| --- | |
| ## Evaluation Results | |
| All metrics are reported on Portuguese benchmarks. Best results are in **bold**; | |
| second-best are <ins>underlined</ins>. | |
| ### Information Retrieval (Reranking, nDCG@10) | |
| Cross-encoder reranking, fine-tuned on mMARCO-PT triples. | |
| | Model | QUATI | mMARCO | Robust04 | **Avg.** | | |
| |-----------------------------|------------------|------------------|------------------|------------------| | |
| | BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 | | |
| | BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 | | |
| | ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 | | |
| | NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 | | |
| | Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 | | |
| | moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 | | |
| | moBERTo-orig-tokenizer | 0.5231 | 0.5089 | 0.4516 | 0.4945 | | |
| | moBERTo-1k | <ins>0.5410</ins>| **0.5169** | <ins>0.4782</ins>| <ins>0.5120</ins>| | |
| | **moBERTo (this model)** | **0.5609** | <ins>0.5147</ins>| **0.5010** | **0.5255** | | |
| ### Long-Context Retrieval (MLDR, nDCG@10) | |
| | Model | 512 | 2,048 | 4,096 | 8,192 | | |
| |-----------------------------|------------------|------------------|------------------|------------------| | |
| | ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 | | |
| | NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- | | |
| | Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 | | |
| | moBERTo-orig-tokenizer-1k | **0.5834** | <ins>0.5909</ins>| **0.6286** | **0.6166** | | |
| | moBERTo-orig-tokenizer | 0.5674 | **0.6025** | 0.5876 | <ins>0.6140</ins>| | |
| | moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 | | |
| | **moBERTo (this model)** | <ins>0.5827</ins>| 0.5606 | <ins>0.5905</ins>| 0.5777 | | |
| ### Classification (F1) | |
| - **Docs**: document type classification (news, legal, academic, etc.) | |
| - **Educ.**: educational content detection | |
| | Model | Docs | Educ. | **Avg.** | | |
| |-----------------------------|------------------|------------------|------------------| | |
| | BERT-base | 0.8700 | 0.5690 | 0.7195 | | |
| | BERTimbau-base | 0.8978 | 0.6382 | 0.7680 | | |
| | ModernBERT-base | 0.8416 | 0.5730 | 0.7073 | | |
| | NeoBERT-base | 0.8970 | 0.6266 | 0.7618 | | |
| | Qwen3-0.6B-base | **0.9120** | 0.6289 | 0.7705 | | |
| | moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 | | |
| | moBERTo-orig-tokenizer | 0.8962 | 0.6035 | 0.7499 | | |
| | moBERTo-1k | 0.9024 | 0.6281 | 0.7653 | | |
| | **moBERTo (this model)** | 0.9039 | <ins>0.6394</ins>| <ins>0.7717</ins>| | |
| | NeoBERT-PT | 0.9030 | **0.6428** | **0.7729** | | |
| | Qwen3-0.6B-PT | <ins>0.9070</ins>| 0.6311 | 0.7691 | | |
| ### NLU and NER (F1) | |
| | Model | PLUE-PT | LeNER-Br | GLUE (English) | | |
| |-----------------------------|------------------|------------------|------------------| | |
| | BERT-base | 0.6423 | 0.8500 | <ins>0.7815</ins>| | |
| | BERTimbau-base | 0.6800 | **0.9040** | 0.6772 | | |
| | ModernBERT-base | 0.6420 | 0.8240 | **0.8301** | | |
| | NeoBERT-base | 0.6654 | 0.8590 | 0.7430 | | |
| | Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 | | |
| | moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 | | |
| | moBERTo-orig-tokenizer | 0.6910 | 0.8587 | 0.7724 | | |
| | moBERTo-1k | <ins>0.6959</ins>| 0.8710 | 0.7128 | | |
| | **moBERTo (this model)** | **0.6980** | 0.8726 | 0.7354 | | |
| | NeoBERT-PT | 0.6842 | <ins>0.8840</ins>| 0.6620 | | |
| | Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 | | |
| > **Note on GLUE:** As expected from continued pretraining on Portuguese, English | |
| > performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301); | |
| --- | |
| ## Key Findings (from the paper's ablations) | |
| 1. **Continued pretraining > training from scratch.** Especially for long-context: | |
| `moBERTo` achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline | |
| trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining | |
| provides representations that transfer effectively even when continued pretraining | |
| itself uses only 1,024-token sequences. | |
| 2. **Tokenizer adaptation helps token-level tasks but disrupts long context.** Moving | |
| to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops | |
| by ~11 points without embedding transfer). | |
| 3. **SWM embedding transfer mitigates the long-context degradation.** By initializing | |
| new Portuguese embeddings as combinations of the original subword embeddings, SWM | |
| recovers most of the long-context performance lost by tokenizer adaptation alone. | |
| 4. **Long-context post-training yields the strongest reranker.** `moBERTo` | |
| (this model) achieves the highest average reranking nDCG@10 (0.5255) and the best | |
| PLUE-PT score (0.6980). | |
| --- | |
| ## Training Data | |
| The pretraining corpus was curated from the Portuguese subset of | |
| [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further | |
| filtered using the educational and STEM classifiers from ClassiCC-PT. The final | |
| corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC, | |
| covering a broad range of domains and topics in Portuguese. | |
| The training data has been publicly released alongside the model. | |
| --- | |
| ## Training Procedure | |
| ### Phase 1 — Continued pretraining (60B tokens at 1,024 context) | |
| | Parameter | Value | | |
| |--------------------------|-----------------------------| | |
| | Training tokens | 60B (5 epochs over 12B) | | |
| | Max sequence length | 1,024 | | |
| | Batch size | 4,608 | | |
| | Masking rate | 30% | | |
| | Optimizer | StableAdamW | | |
| | Learning rate | 5e-4 | | |
| | Weight decay | 1e-5 | | |
| | Dropout (attn output) | 0.1 | | |
| | Dropout (other) | 0.0 | | |
| | Precision | bfloat16 | | |
| | RoPE base (global attn) | 160,000 | | |
| | RoPE base (local attn) | 10,000 | | |
| ### Phase 2 — Long-context post-training (10B tokens at 8,192 context) | |
| Same hyperparameters as Phase 1, except: | |
| | Parameter | Value | | |
| |--------------------------|-----------------------------| | |
| | Training tokens | 10B | | |
| | Max sequence length | 8,192 | | |
| | Batch size | 576 | | |
| --- | |
| ## Related Models in the moBERTo Family | |
| | Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. | | |
| |------------------------------------------------|-----------------------------|-----------|-------------------| | |
| | `Tropic-AI/moBERTo-orig-tokenizer` | moBERTo-8k (orig. tok.) | Original | Yes | | |
| | **`Tropic-AI/moBERTo` *(this)* * | **moBERTo-SWM-8k (PT tok.)**| PT (SWM) | **Yes** | | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{laitz2026mobertomodernencoderportuguese, | |
| title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, | |
| author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás}, | |
| year={2026}, | |
| eprint={2606.22722}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2606.22722}, | |
| } | |
| ``` |