moBERTo-orig-tokenizer

Paper name: This model is referred to as moBERTo-8k (orig. tok.) in the moBERTo paper. It is the variant that retains the original ModernBERT tokenizer, followed by long-context post-training.

moBERTo-orig-tokenizer is a Portuguese adaptation of ModernBERT, obtained through continued pretraining on a curated 12-billion-token Portuguese corpus (60B training tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context.

Unlike the flagship moBERTo variant, this model keeps the original ModernBERT tokenizer rather than adopting a Portuguese-optimized one. This makes it particularly strong for long-context retrieval tasks (where it achieves the best results in the moBERTo family at 8,192 tokens on MLDR), at the cost of slightly weaker token-level performance (NER and NLU).

The model preserves all architectural advances of ModernBERT: rotary positional embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, with a native context window of 8,192 tokens.

Model Details

Attribute	Value
Architecture	ModernBERT (encoder-only)
Base checkpoint	`answerdotai/ModernBERT-base`
Parameters	~150M
Max context length	8,192 tokens
Tokenizer	Original ModernBERT tokenizer (unchanged)
Embedding init	Inherited from ModernBERT-base
Pretraining tokens	60B (5 epochs over 12B-token corpus)
Long-context post-tr.	10B tokens at 8,192-token context
Training corpus	FineWeb-2 (PT subset) filtered with ClassiCC-PT
Framework	Composer
Precision	bfloat16
License	Apache 2.0

Quick Start

long_text = "..."  # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

Recommended for downstream tasks

This model is best used as a backbone for fine-tuning on:

Long-document retrieval (its strongest use case)
Cross-encoder reranking (information retrieval)
Document classification
Named entity recognition
Natural language inference / semantic textual similarity

Evaluation Results

All metrics are reported on Portuguese benchmarks. Best results are in bold; second-best are underlined.

Information Retrieval (Reranking, nDCG@10)

Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

Model	QUATI	mMARCO	Robust04	Avg.
BERT-base	0.2846	0.4050	0.2389	0.3095
BERTimbau-base	0.4870	0.5005	0.4138	0.4671
ModernBERT-base	0.3779	0.4799	0.2988	0.3855
NeoBERT-base	0.4000	0.4698	0.3117	0.3938
Qwen3-0.6B-base	0.4248	0.5065	0.2994	0.4102
moBERTo-orig-tokenizer-1k	0.5383	0.5109	0.4510	0.5001
moBERTo-orig-tokenizer (this model)	0.5231	0.5089	0.4516	0.4945
moBERTo-1k	0.5410	0.5169	0.4782	0.5120
moBERTo	0.5609	0.5147	0.5010	0.5255

Long-Context Retrieval (MLDR, nDCG@10)

Model	512	2,048	4,096	8,192
ModernBERT-base	0.4054	0.4206	0.3015	0.2867
NeoBERT-base	0.4746	0.5149	0.4676	--
Qwen3-0.6B-base	0.3560	0.4023	0.4241	0.5351
moBERTo-orig-tokenizer-1k	0.5834	0.5909	0.6286	0.6166
moBERTo-orig-tokenizer (this model)	0.5674	0.6025	0.5876	0.6140
moBERTo-1k	0.5466	0.4791	0.5714	0.5857
moBERTo	0.5827	0.5606	0.5905	0.5777

Classification (F1)

Docs: document type classification (news, legal, academic, etc.)
Educ.: educational content detection

Model	Docs	Educ.	Avg.
BERT-base	0.8700	0.5690	0.7195
BERTimbau-base	0.8978	0.6382	0.7680
ModernBERT-base	0.8416	0.5730	0.7073
NeoBERT-base	0.8970	0.6266	0.7618
Qwen3-0.6B-base	0.9120	0.6289	0.7705
moBERTo-orig-tokenizer-1k	0.8942	0.6070	0.7506
moBERTo-orig-tokenizer (this model)	0.8962	0.6035	0.7499
moBERTo-1k	0.9024	0.6281	0.7653
moBERTo	0.9039	0.6394	0.7717
NeoBERT-PT	0.9030	0.6428	0.7729
Qwen3-0.6B-PT	0.9070	0.6311	0.7691

NLU and NER (F1)

Model	PLUE-PT	LeNER-Br	GLUE (English)
BERT-base	0.6423	0.8500	0.7815
BERTimbau-base	0.6800	0.9040	0.6772
ModernBERT-base	0.6420	0.8240	0.8301
NeoBERT-base	0.6654	0.8590	0.7430
Qwen3-0.6B-base	0.6343	0.7020	0.7260
moBERTo-orig-tokenizer-1k	0.6849	0.8371	0.7705
moBERTo-orig-tokenizer (this model)	0.6910	0.8587	0.7724
moBERTo-1k	0.6959	0.8710	0.7128
moBERTo	0.6980	0.8726	0.7354
NeoBERT-PT	0.6842	0.8840	0.6620
Qwen3-0.6B-PT	0.6632	0.7100	0.7050

Training Data

The pretraining corpus was curated from the Portuguese subset of FineWeb-2 and further filtered using the educational and STEM classifiers from ClassiCC-PT. The final corpus comprises ~12 billion tokens, roughly six times larger than BrWaC, covering a broad range of domains and topics in Portuguese.

The training data has been publicly released alongside the model.

Training Procedure

Phase 1 — Continued pretraining (60B tokens at 1,024 context)

Parameter	Value
Training tokens	60B (5 epochs over 12B)
Max sequence length	1,024
Batch size	4,608
Masking rate	30%
Optimizer	StableAdamW
Learning rate	5e-4
Weight decay	1e-5
Dropout (attn output)	0.1
Dropout (other)	0.0
Precision	bfloat16
RoPE base (global attn)	160,000
RoPE base (local attn)	10,000

Phase 2 — Long-context post-training (10B tokens at 8,192 context)

Same hyperparameters as Phase 1, except:

Parameter	Value
Training tokens	10B
Max sequence length	8,192
Batch size	576

Related Models in the moBERTo Family

Hugging Face Repo	Paper Name	Tokenizer	Long-ctx post-tr.
`Tropic-AI/moBERTo-orig-tokenizer` (this)	moBERTo-8k (orig. tok.)	Original	Yes
`Tropic-AI/moBERTo`	moBERTo-SWM-8k (PT tok.)	PT (SWM)	Yes

Citation

@misc{laitz2026mobertomodernencoderportuguese,
      title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, 
      author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
      year={2026},
      eprint={2606.22722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.22722}, 
}