moBERTo

Paper name: This model is referred to as moBERTo-SWM-8k (PT tok.) in the moBERTo paper. It is the best-performing variant of the moBERTo family, achieving the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks and the best PLUE-PT score.

moBERTo is a Portuguese adaptation of ModernBERT, obtained through continued pretraining on a curated 12-billion-token Portuguese corpus (60B training tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context.

It combines four adaptation strategies:

Continued pretraining from the original ModernBERT-base checkpoint, preserving the long-context capabilities learned during the original 2T-token English pretraining.
Portuguese tokenizer with vocabulary optimized for Portuguese text.
Subword Matching (SWM) embedding transfer, which initializes each new Portuguese token's embedding as a combination of the original ModernBERT subword embeddings, keeping the model close to its pretrained representation space.
Long-context post-training at 8,192 tokens for an additional 10B tokens.

The model preserves all architectural advances of ModernBERT: rotary positional embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, with a native context window of 8,192 tokens.

Model Details

Attribute	Value
Architecture	ModernBERT (encoder-only)
Base checkpoint	`answerdotai/ModernBERT-base`
Parameters	~150M
Max context length	8,192 tokens
Tokenizer	Portuguese (custom vocabulary)
Embedding init	Subword Matching (SWM) transfer
Pretraining tokens	60B (5 epochs over 12B-token corpus)
Long-context post-tr.	10B tokens at 8,192-token context
Training corpus	FineWeb-2 (PT subset) filtered with ClassiCC-PT
Framework	Composer
Precision	bfloat16
License	Apache 2.0

Quick Start

long_text = "..."  # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

Recommended for downstream tasks

This model is best used as a backbone for fine-tuning on:

Cross-encoder reranking (information retrieval)
Document classification
Named entity recognition
Natural language inference / semantic textual similarity
Long-document retrieval (up to 8,192 tokens)

Evaluation Results

All metrics are reported on Portuguese benchmarks. Best results are in bold; second-best are underlined.

Information Retrieval (Reranking, nDCG@10)

Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

Model	QUATI	mMARCO	Robust04	Avg.
BERT-base	0.2846	0.4050	0.2389	0.3095
BERTimbau-base	0.4870	0.5005	0.4138	0.4671
ModernBERT-base	0.3779	0.4799	0.2988	0.3855
NeoBERT-base	0.4000	0.4698	0.3117	0.3938
Qwen3-0.6B-base	0.4248	0.5065	0.2994	0.4102
moBERTo-orig-tokenizer-1k	0.5383	0.5109	0.4510	0.5001
moBERTo-orig-tokenizer	0.5231	0.5089	0.4516	0.4945
moBERTo-1k	0.5410	0.5169	0.4782	0.5120
moBERTo (this model)	0.5609	0.5147	0.5010	0.5255

Long-Context Retrieval (MLDR, nDCG@10)

Model	512	2,048	4,096	8,192
ModernBERT-base	0.4054	0.4206	0.3015	0.2867
NeoBERT-base	0.4746	0.5149	0.4676	--
Qwen3-0.6B-base	0.3560	0.4023	0.4241	0.5351
moBERTo-orig-tokenizer-1k	0.5834	0.5909	0.6286	0.6166
moBERTo-orig-tokenizer	0.5674	0.6025	0.5876	0.6140
moBERTo-1k	0.5466	0.4791	0.5714	0.5857
moBERTo (this model)	0.5827	0.5606	0.5905	0.5777

Classification (F1)

Docs: document type classification (news, legal, academic, etc.)
Educ.: educational content detection

Model	Docs	Educ.	Avg.
BERT-base	0.8700	0.5690	0.7195
BERTimbau-base	0.8978	0.6382	0.7680
ModernBERT-base	0.8416	0.5730	0.7073
NeoBERT-base	0.8970	0.6266	0.7618
Qwen3-0.6B-base	0.9120	0.6289	0.7705
moBERTo-orig-tokenizer-1k	0.8942	0.6070	0.7506
moBERTo-orig-tokenizer	0.8962	0.6035	0.7499
moBERTo-1k	0.9024	0.6281	0.7653
moBERTo (this model)	0.9039	0.6394	0.7717
NeoBERT-PT	0.9030	0.6428	0.7729
Qwen3-0.6B-PT	0.9070	0.6311	0.7691

NLU and NER (F1)

Model	PLUE-PT	LeNER-Br	GLUE (English)
BERT-base	0.6423	0.8500	0.7815
BERTimbau-base	0.6800	0.9040	0.6772
ModernBERT-base	0.6420	0.8240	0.8301
NeoBERT-base	0.6654	0.8590	0.7430
Qwen3-0.6B-base	0.6343	0.7020	0.7260
moBERTo-orig-tokenizer-1k	0.6849	0.8371	0.7705
moBERTo-orig-tokenizer	0.6910	0.8587	0.7724
moBERTo-1k	0.6959	0.8710	0.7128
moBERTo (this model)	0.6980	0.8726	0.7354
NeoBERT-PT	0.6842	0.8840	0.6620
Qwen3-0.6B-PT	0.6632	0.7100	0.7050

Note on GLUE: As expected from continued pretraining on Portuguese, English performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301);

Key Findings (from the paper's ablations)

Continued pretraining > training from scratch. Especially for long-context: moBERTo achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining provides representations that transfer effectively even when continued pretraining itself uses only 1,024-token sequences.
Tokenizer adaptation helps token-level tasks but disrupts long context. Moving to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops by ~11 points without embedding transfer).
SWM embedding transfer mitigates the long-context degradation. By initializing new Portuguese embeddings as combinations of the original subword embeddings, SWM recovers most of the long-context performance lost by tokenizer adaptation alone.
Long-context post-training yields the strongest reranker. moBERTo (this model) achieves the highest average reranking nDCG@10 (0.5255) and the best PLUE-PT score (0.6980).

Training Data

The pretraining corpus was curated from the Portuguese subset of FineWeb-2 and further filtered using the educational and STEM classifiers from ClassiCC-PT. The final corpus comprises ~12 billion tokens, roughly six times larger than BrWaC, covering a broad range of domains and topics in Portuguese.

The training data has been publicly released alongside the model.

Training Procedure

Phase 1 — Continued pretraining (60B tokens at 1,024 context)

Parameter	Value
Training tokens	60B (5 epochs over 12B)
Max sequence length	1,024
Batch size	4,608
Masking rate	30%
Optimizer	StableAdamW
Learning rate	5e-4
Weight decay	1e-5
Dropout (attn output)	0.1
Dropout (other)	0.0
Precision	bfloat16
RoPE base (global attn)	160,000
RoPE base (local attn)	10,000

Phase 2 — Long-context post-training (10B tokens at 8,192 context)

Same hyperparameters as Phase 1, except:

Parameter	Value
Training tokens	10B
Max sequence length	8,192
Batch size	576

Related Models in the moBERTo Family

Hugging Face Repo	Paper Name	Tokenizer	Long-ctx post-tr.
`Tropic-AI/moBERTo-orig-tokenizer`	moBERTo-8k (orig. tok.)	Original	Yes
*`Tropic-AI/moBERTo` (this)* *	moBERTo-SWM-8k (PT tok.)	PT (SWM)	Yes

Citation

@misc{laitz2026mobertomodernencoderportuguese,
      title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, 
      author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
      year={2026},
      eprint={2606.22722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22722}, 
}