How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="Tropic-AI/moBERTo")
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("Tropic-AI/moBERTo")
model = AutoModelForMaskedLM.from_pretrained("Tropic-AI/moBERTo")
Quick Links

moBERTo

Paper name: This model is referred to as moBERTo-SWM-8k (PT tok.) in the moBERTo paper. It is the best-performing variant of the moBERTo family, achieving the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks and the best PLUE-PT score.

moBERTo is a Portuguese adaptation of ModernBERT, obtained through continued pretraining on a curated 12-billion-token Portuguese corpus (60B training tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context.

It combines four adaptation strategies:

  1. Continued pretraining from the original ModernBERT-base checkpoint, preserving the long-context capabilities learned during the original 2T-token English pretraining.
  2. Portuguese tokenizer with vocabulary optimized for Portuguese text.
  3. Subword Matching (SWM) embedding transfer, which initializes each new Portuguese token's embedding as a combination of the original ModernBERT subword embeddings, keeping the model close to its pretrained representation space.
  4. Long-context post-training at 8,192 tokens for an additional 10B tokens.

The model preserves all architectural advances of ModernBERT: rotary positional embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, with a native context window of 8,192 tokens.


Model Details

Attribute Value
Architecture ModernBERT (encoder-only)
Base checkpoint answerdotai/ModernBERT-base
Parameters ~150M
Max context length 8,192 tokens
Tokenizer Portuguese (custom vocabulary)
Embedding init Subword Matching (SWM) transfer
Pretraining tokens 60B (5 epochs over 12B-token corpus)
Long-context post-tr. 10B tokens at 8,192-token context
Training corpus FineWeb-2 (PT subset) filtered with ClassiCC-PT
Framework Composer
Precision bfloat16
License Apache 2.0

Quick Start

long_text = "..."  # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

Recommended for downstream tasks

This model is best used as a backbone for fine-tuning on:

  • Cross-encoder reranking (information retrieval)
  • Document classification
  • Named entity recognition
  • Natural language inference / semantic textual similarity
  • Long-document retrieval (up to 8,192 tokens)

Evaluation Results

All metrics are reported on Portuguese benchmarks. Best results are in bold; second-best are underlined.

Information Retrieval (Reranking, nDCG@10)

Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

Model QUATI mMARCO Robust04 Avg.
BERT-base 0.2846 0.4050 0.2389 0.3095
BERTimbau-base 0.4870 0.5005 0.4138 0.4671
ModernBERT-base 0.3779 0.4799 0.2988 0.3855
NeoBERT-base 0.4000 0.4698 0.3117 0.3938
Qwen3-0.6B-base 0.4248 0.5065 0.2994 0.4102
moBERTo-orig-tokenizer-1k 0.5383 0.5109 0.4510 0.5001
moBERTo-orig-tokenizer 0.5231 0.5089 0.4516 0.4945
moBERTo-1k 0.5410 0.5169 0.4782 0.5120
moBERTo (this model) 0.5609 0.5147 0.5010 0.5255

Long-Context Retrieval (MLDR, nDCG@10)

Model 512 2,048 4,096 8,192
ModernBERT-base 0.4054 0.4206 0.3015 0.2867
NeoBERT-base 0.4746 0.5149 0.4676 --
Qwen3-0.6B-base 0.3560 0.4023 0.4241 0.5351
moBERTo-orig-tokenizer-1k 0.5834 0.5909 0.6286 0.6166
moBERTo-orig-tokenizer 0.5674 0.6025 0.5876 0.6140
moBERTo-1k 0.5466 0.4791 0.5714 0.5857
moBERTo (this model) 0.5827 0.5606 0.5905 0.5777

Classification (F1)

  • Docs: document type classification (news, legal, academic, etc.)
  • Educ.: educational content detection
Model Docs Educ. Avg.
BERT-base 0.8700 0.5690 0.7195
BERTimbau-base 0.8978 0.6382 0.7680
ModernBERT-base 0.8416 0.5730 0.7073
NeoBERT-base 0.8970 0.6266 0.7618
Qwen3-0.6B-base 0.9120 0.6289 0.7705
moBERTo-orig-tokenizer-1k 0.8942 0.6070 0.7506
moBERTo-orig-tokenizer 0.8962 0.6035 0.7499
moBERTo-1k 0.9024 0.6281 0.7653
moBERTo (this model) 0.9039 0.6394 0.7717
NeoBERT-PT 0.9030 0.6428 0.7729
Qwen3-0.6B-PT 0.9070 0.6311 0.7691

NLU and NER (F1)

Model PLUE-PT LeNER-Br GLUE (English)
BERT-base 0.6423 0.8500 0.7815
BERTimbau-base 0.6800 0.9040 0.6772
ModernBERT-base 0.6420 0.8240 0.8301
NeoBERT-base 0.6654 0.8590 0.7430
Qwen3-0.6B-base 0.6343 0.7020 0.7260
moBERTo-orig-tokenizer-1k 0.6849 0.8371 0.7705
moBERTo-orig-tokenizer 0.6910 0.8587 0.7724
moBERTo-1k 0.6959 0.8710 0.7128
moBERTo (this model) 0.6980 0.8726 0.7354
NeoBERT-PT 0.6842 0.8840 0.6620
Qwen3-0.6B-PT 0.6632 0.7100 0.7050

Note on GLUE: As expected from continued pretraining on Portuguese, English performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301);


Key Findings (from the paper's ablations)

  1. Continued pretraining > training from scratch. Especially for long-context: moBERTo achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining provides representations that transfer effectively even when continued pretraining itself uses only 1,024-token sequences.
  2. Tokenizer adaptation helps token-level tasks but disrupts long context. Moving to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops by ~11 points without embedding transfer).
  3. SWM embedding transfer mitigates the long-context degradation. By initializing new Portuguese embeddings as combinations of the original subword embeddings, SWM recovers most of the long-context performance lost by tokenizer adaptation alone.
  4. Long-context post-training yields the strongest reranker. moBERTo (this model) achieves the highest average reranking nDCG@10 (0.5255) and the best PLUE-PT score (0.6980).

Training Data

The pretraining corpus was curated from the Portuguese subset of FineWeb-2 and further filtered using the educational and STEM classifiers from ClassiCC-PT. The final corpus comprises ~12 billion tokens, roughly six times larger than BrWaC, covering a broad range of domains and topics in Portuguese.

The training data has been publicly released alongside the model.


Training Procedure

Phase 1 — Continued pretraining (60B tokens at 1,024 context)

Parameter Value
Training tokens 60B (5 epochs over 12B)
Max sequence length 1,024
Batch size 4,608
Masking rate 30%
Optimizer StableAdamW
Learning rate 5e-4
Weight decay 1e-5
Dropout (attn output) 0.1
Dropout (other) 0.0
Precision bfloat16
RoPE base (global attn) 160,000
RoPE base (local attn) 10,000

Phase 2 — Long-context post-training (10B tokens at 8,192 context)

Same hyperparameters as Phase 1, except:

Parameter Value
Training tokens 10B
Max sequence length 8,192
Batch size 576

Related Models in the moBERTo Family

Hugging Face Repo Paper Name Tokenizer Long-ctx post-tr.
Tropic-AI/moBERTo-orig-tokenizer moBERTo-8k (orig. tok.) Original Yes
**Tropic-AI/moBERTo (this) * moBERTo-SWM-8k (PT tok.) PT (SWM) Yes

Citation

@misc{laitz2026mobertomodernencoderportuguese,
      title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, 
      author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
      year={2026},
      eprint={2606.22722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22722}, 
}
Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tropic-AI/moBERTo

Finetuned
(1342)
this model

Dataset used to train Tropic-AI/moBERTo

Collection including Tropic-AI/moBERTo

Paper for Tropic-AI/moBERTo

Evaluation results