Instructions to use Tropic-AI/moBERTo-orig-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tropic-AI/moBERTo-orig-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Tropic-AI/moBERTo-orig-tokenizer")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Tropic-AI/moBERTo-orig-tokenizer") model = AutoModelForMaskedLM.from_pretrained("Tropic-AI/moBERTo-orig-tokenizer") - Notebooks
- Google Colab
- Kaggle
moBERTo-orig-tokenizer
Paper name: This model is referred to as
moBERTo-8k (orig. tok.)in the moBERTo paper. It is the variant that retains the original ModernBERT tokenizer, followed by long-context post-training.
moBERTo-orig-tokenizer is a Portuguese adaptation of
ModernBERT, obtained through
continued pretraining on a curated 12-billion-token Portuguese corpus (60B training
tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context.
Unlike the flagship moBERTo variant,
this model keeps the original ModernBERT tokenizer rather than adopting a
Portuguese-optimized one. This makes it particularly strong for long-context
retrieval tasks (where it achieves the best results in the moBERTo family at
8,192 tokens on MLDR), at the cost of slightly weaker token-level performance
(NER and NLU).
The model preserves all architectural advances of ModernBERT: rotary positional embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, with a native context window of 8,192 tokens.
Model Details
| Attribute | Value |
|---|---|
| Architecture | ModernBERT (encoder-only) |
| Base checkpoint | answerdotai/ModernBERT-base |
| Parameters | ~150M |
| Max context length | 8,192 tokens |
| Tokenizer | Original ModernBERT tokenizer (unchanged) |
| Embedding init | Inherited from ModernBERT-base |
| Pretraining tokens | 60B (5 epochs over 12B-token corpus) |
| Long-context post-tr. | 10B tokens at 8,192-token context |
| Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT |
| Framework | Composer |
| Precision | bfloat16 |
| License | Apache 2.0 |
Quick Start
long_text = "..." # documento longo em português
inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
Recommended for downstream tasks
This model is best used as a backbone for fine-tuning on:
- Long-document retrieval (its strongest use case)
- Cross-encoder reranking (information retrieval)
- Document classification
- Named entity recognition
- Natural language inference / semantic textual similarity
Evaluation Results
All metrics are reported on Portuguese benchmarks. Best results are in bold; second-best are underlined.
Information Retrieval (Reranking, nDCG@10)
Cross-encoder reranking, fine-tuned on mMARCO-PT triples.
| Model | QUATI | mMARCO | Robust04 | Avg. |
|---|---|---|---|---|
| BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 |
| BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 |
| ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 |
| NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 |
| Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 |
| moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 |
| moBERTo-orig-tokenizer (this model) | 0.5231 | 0.5089 | 0.4516 | 0.4945 |
| moBERTo-1k | 0.5410 | 0.5169 | 0.4782 | 0.5120 |
| moBERTo | 0.5609 | 0.5147 | 0.5010 | 0.5255 |
Long-Context Retrieval (MLDR, nDCG@10)
| Model | 512 | 2,048 | 4,096 | 8,192 |
|---|---|---|---|---|
| ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 |
| NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- |
| Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 |
| moBERTo-orig-tokenizer-1k | 0.5834 | 0.5909 | 0.6286 | 0.6166 |
| moBERTo-orig-tokenizer (this model) | 0.5674 | 0.6025 | 0.5876 | 0.6140 |
| moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 |
| moBERTo | 0.5827 | 0.5606 | 0.5905 | 0.5777 |
Classification (F1)
- Docs: document type classification (news, legal, academic, etc.)
- Educ.: educational content detection
| Model | Docs | Educ. | Avg. |
|---|---|---|---|
| BERT-base | 0.8700 | 0.5690 | 0.7195 |
| BERTimbau-base | 0.8978 | 0.6382 | 0.7680 |
| ModernBERT-base | 0.8416 | 0.5730 | 0.7073 |
| NeoBERT-base | 0.8970 | 0.6266 | 0.7618 |
| Qwen3-0.6B-base | 0.9120 | 0.6289 | 0.7705 |
| moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 |
| moBERTo-orig-tokenizer (this model) | 0.8962 | 0.6035 | 0.7499 |
| moBERTo-1k | 0.9024 | 0.6281 | 0.7653 |
| moBERTo | 0.9039 | 0.6394 | 0.7717 |
| NeoBERT-PT | 0.9030 | 0.6428 | 0.7729 |
| Qwen3-0.6B-PT | 0.9070 | 0.6311 | 0.7691 |
NLU and NER (F1)
| Model | PLUE-PT | LeNER-Br | GLUE (English) |
|---|---|---|---|
| BERT-base | 0.6423 | 0.8500 | 0.7815 |
| BERTimbau-base | 0.6800 | 0.9040 | 0.6772 |
| ModernBERT-base | 0.6420 | 0.8240 | 0.8301 |
| NeoBERT-base | 0.6654 | 0.8590 | 0.7430 |
| Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 |
| moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 |
| moBERTo-orig-tokenizer (this model) | 0.6910 | 0.8587 | 0.7724 |
| moBERTo-1k | 0.6959 | 0.8710 | 0.7128 |
| moBERTo | 0.6980 | 0.8726 | 0.7354 |
| NeoBERT-PT | 0.6842 | 0.8840 | 0.6620 |
| Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 |
Training Data
The pretraining corpus was curated from the Portuguese subset of FineWeb-2 and further filtered using the educational and STEM classifiers from ClassiCC-PT. The final corpus comprises ~12 billion tokens, roughly six times larger than BrWaC, covering a broad range of domains and topics in Portuguese.
The training data has been publicly released alongside the model.
Training Procedure
Phase 1 — Continued pretraining (60B tokens at 1,024 context)
| Parameter | Value |
|---|---|
| Training tokens | 60B (5 epochs over 12B) |
| Max sequence length | 1,024 |
| Batch size | 4,608 |
| Masking rate | 30% |
| Optimizer | StableAdamW |
| Learning rate | 5e-4 |
| Weight decay | 1e-5 |
| Dropout (attn output) | 0.1 |
| Dropout (other) | 0.0 |
| Precision | bfloat16 |
| RoPE base (global attn) | 160,000 |
| RoPE base (local attn) | 10,000 |
Phase 2 — Long-context post-training (10B tokens at 8,192 context)
Same hyperparameters as Phase 1, except:
| Parameter | Value |
|---|---|
| Training tokens | 10B |
| Max sequence length | 8,192 |
| Batch size | 576 |
Related Models in the moBERTo Family
| Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. |
|---|---|---|---|
Tropic-AI/moBERTo-orig-tokenizer (this) |
moBERTo-8k (orig. tok.) | Original | Yes |
Tropic-AI/moBERTo |
moBERTo-SWM-8k (PT tok.) | PT (SWM) | Yes |
Citation
@misc{laitz2026mobertomodernencoderportuguese,
title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT},
author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
year={2026},
eprint={2606.22722},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.22722},
}
- Downloads last month
- 11
Model tree for Tropic-AI/moBERTo-orig-tokenizer
Base model
answerdotai/ModernBERT-baseDataset used to train Tropic-AI/moBERTo-orig-tokenizer
Collection including Tropic-AI/moBERTo-orig-tokenizer
Paper for Tropic-AI/moBERTo-orig-tokenizer
Evaluation results
- nDCG@10 on QUATIself-reported0.523
- nDCG@10 on mMARCO-PTself-reported0.509
- nDCG@10 on Robust04-PTself-reported0.452
- nDCG@10 at 8192 tokens on MLDR (PT)self-reported0.614
- F1 on LeNER-Brself-reported0.859
- F1 on PLUE-PTself-reported0.691