Fill-Mask
Transformers
PyTorch
Safetensors
Portuguese
modernbert
portuguese
encoder
masked-lm
long-context
moberto
Eval Results (legacy)
Instructions to use Tropic-AI/moBERTo-orig-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Tropic-AI/moBERTo-orig-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Tropic-AI/moBERTo-orig-tokenizer")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("Tropic-AI/moBERTo-orig-tokenizer") model = AutoModelForMaskedLM.from_pretrained("Tropic-AI/moBERTo-orig-tokenizer") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - pt | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - modernbert | |
| - portuguese | |
| - encoder | |
| - masked-lm | |
| - long-context | |
| - moberto | |
| pipeline_tag: fill-mask | |
| base_model: answerdotai/ModernBERT-base | |
| datasets: | |
| - HuggingFaceFW/fineweb-2 | |
| metrics: | |
| - nDCG@10 | |
| - F1 | |
| model-index: | |
| - name: moBERTo-orig-tokenizer | |
| results: | |
| - task: | |
| type: text-retrieval | |
| name: Reranking | |
| dataset: | |
| name: QUATI | |
| type: quati | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.5231 | |
| - task: | |
| type: text-retrieval | |
| name: Reranking | |
| dataset: | |
| name: mMARCO-PT | |
| type: mmarco-pt | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.5089 | |
| - task: | |
| type: text-retrieval | |
| name: Reranking | |
| dataset: | |
| name: Robust04-PT | |
| type: robust04-pt | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.4516 | |
| - task: | |
| type: text-retrieval | |
| name: Long-Context Reranking | |
| dataset: | |
| name: MLDR (PT) | |
| type: mldr | |
| metrics: | |
| - type: nDCG@10 | |
| value: 0.6140 | |
| name: nDCG@10 at 8192 tokens | |
| - task: | |
| type: token-classification | |
| name: Named Entity Recognition | |
| dataset: | |
| name: LeNER-Br | |
| type: lener-br | |
| metrics: | |
| - type: F1 | |
| value: 0.8587 | |
| - task: | |
| type: text-classification | |
| name: Natural Language Understanding | |
| dataset: | |
| name: PLUE-PT | |
| type: plue-pt | |
| metrics: | |
| - type: F1 | |
| value: 0.6910 | |
| # moBERTo-orig-tokenizer | |
| > **Paper name:** This model is referred to as **`moBERTo-8k (orig. tok.)`** in the | |
| > moBERTo paper. It is the variant that retains the **original ModernBERT tokenizer**, | |
| > followed by long-context post-training. | |
| `moBERTo-orig-tokenizer` is a Portuguese adaptation of | |
| [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base), obtained through | |
| continued pretraining on a curated 12-billion-token Portuguese corpus (60B training | |
| tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context. | |
| Unlike the flagship [`moBERTo`](https://huggingface.co/Tropic-AI/moBERTo) variant, | |
| this model **keeps the original ModernBERT tokenizer** rather than adopting a | |
| Portuguese-optimized one. This makes it particularly strong for long-context | |
| retrieval tasks (where it achieves the **best results in the moBERTo family** at | |
| 8,192 tokens on MLDR), at the cost of slightly weaker token-level performance | |
| (NER and NLU). | |
| The model preserves all architectural advances of ModernBERT: rotary positional | |
| embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, | |
| with a native context window of **8,192 tokens**. | |
| --- | |
| ## Model Details | |
| | Attribute | Value | | |
| |------------------------|------------------------------------------------------| | |
| | Architecture | ModernBERT (encoder-only) | | |
| | Base checkpoint | `answerdotai/ModernBERT-base` | | |
| | Parameters | ~150M | | |
| | Max context length | 8,192 tokens | | |
| | Tokenizer | **Original ModernBERT tokenizer (unchanged)** | | |
| | Embedding init | Inherited from ModernBERT-base | | |
| | Pretraining tokens | 60B (5 epochs over 12B-token corpus) | | |
| | Long-context post-tr. | 10B tokens at 8,192-token context | | |
| | Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT | | |
| | Framework | Composer | | |
| | Precision | bfloat16 | | |
| | License | Apache 2.0 | | |
| --- | |
| ## Quick Start | |
| ```python | |
| long_text = "..." # documento longo em português | |
| inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt") | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| ``` | |
| ### Recommended for downstream tasks | |
| This model is best used as a backbone for fine-tuning on: | |
| - **Long-document retrieval** (its strongest use case) | |
| - **Cross-encoder reranking** (information retrieval) | |
| - **Document classification** | |
| - **Named entity recognition** | |
| - **Natural language inference / semantic textual similarity** | |
| --- | |
| ## Evaluation Results | |
| All metrics are reported on Portuguese benchmarks. Best results are in **bold**; | |
| second-best are <ins>underlined</ins>. | |
| ### Information Retrieval (Reranking, nDCG@10) | |
| Cross-encoder reranking, fine-tuned on mMARCO-PT triples. | |
| | Model | QUATI | mMARCO | Robust04 | **Avg.** | | |
| |----------------------------------------|------------------|------------------|------------------|------------------| | |
| | BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 | | |
| | BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 | | |
| | ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 | | |
| | NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 | | |
| | Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 | | |
| | moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 | | |
| | **moBERTo-orig-tokenizer (this model)**| 0.5231 | 0.5089 | 0.4516 | 0.4945 | | |
| | moBERTo-1k | <ins>0.5410</ins>| **0.5169** | <ins>0.4782</ins>| <ins>0.5120</ins>| | |
| | moBERTo | **0.5609** | <ins>0.5147</ins>| **0.5010** | **0.5255** | | |
| ### Long-Context Retrieval (MLDR, nDCG@10) | |
| | Model | 512 | 2,048 | 4,096 | 8,192 | | |
| |----------------------------------------|------------------|------------------|------------------|------------------| | |
| | ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 | | |
| | NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- | | |
| | Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 | | |
| | moBERTo-orig-tokenizer-1k | **0.5834** | <ins>0.5909</ins>| **0.6286** | **0.6166** | | |
| | **moBERTo-orig-tokenizer (this model)**| 0.5674 | **0.6025** | 0.5876 | <ins>0.6140</ins>| | |
| | moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 | | |
| | moBERTo | <ins>0.5827</ins>| 0.5606 | <ins>0.5905</ins>| 0.5777 | | |
| ### Classification (F1) | |
| - **Docs**: document type classification (news, legal, academic, etc.) | |
| - **Educ.**: educational content detection | |
| | Model | Docs | Educ. | **Avg.** | | |
| |----------------------------------------|------------------|------------------|------------------| | |
| | BERT-base | 0.8700 | 0.5690 | 0.7195 | | |
| | BERTimbau-base | 0.8978 | 0.6382 | 0.7680 | | |
| | ModernBERT-base | 0.8416 | 0.5730 | 0.7073 | | |
| | NeoBERT-base | 0.8970 | 0.6266 | 0.7618 | | |
| | Qwen3-0.6B-base | **0.9120** | 0.6289 | 0.7705 | | |
| | moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 | | |
| | **moBERTo-orig-tokenizer (this model)**| 0.8962 | 0.6035 | 0.7499 | | |
| | moBERTo-1k | 0.9024 | 0.6281 | 0.7653 | | |
| | moBERTo | 0.9039 | <ins>0.6394</ins>| <ins>0.7717</ins>| | |
| | NeoBERT-PT | 0.9030 | **0.6428** | **0.7729** | | |
| | Qwen3-0.6B-PT | <ins>0.9070</ins>| 0.6311 | 0.7691 | | |
| ### NLU and NER (F1) | |
| | Model | PLUE-PT | LeNER-Br | GLUE (English) | | |
| |----------------------------------------|------------------|------------------|------------------| | |
| | BERT-base | 0.6423 | 0.8500 | <ins>0.7815</ins>| | |
| | BERTimbau-base | 0.6800 | **0.9040** | 0.6772 | | |
| | ModernBERT-base | 0.6420 | 0.8240 | **0.8301** | | |
| | NeoBERT-base | 0.6654 | 0.8590 | 0.7430 | | |
| | Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 | | |
| | moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 | | |
| | **moBERTo-orig-tokenizer (this model)**| 0.6910 | 0.8587 | 0.7724 | | |
| | moBERTo-1k | <ins>0.6959</ins>| 0.8710 | 0.7128 | | |
| | moBERTo | **0.6980** | 0.8726 | 0.7354 | | |
| | NeoBERT-PT | 0.6842 | <ins>0.8840</ins>| 0.6620 | | |
| | Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 | | |
| ## Training Data | |
| The pretraining corpus was curated from the Portuguese subset of | |
| [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further | |
| filtered using the educational and STEM classifiers from ClassiCC-PT. The final | |
| corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC, | |
| covering a broad range of domains and topics in Portuguese. | |
| The training data has been publicly released alongside the model. | |
| --- | |
| ## Training Procedure | |
| ### Phase 1 — Continued pretraining (60B tokens at 1,024 context) | |
| | Parameter | Value | | |
| |--------------------------|-----------------------------| | |
| | Training tokens | 60B (5 epochs over 12B) | | |
| | Max sequence length | 1,024 | | |
| | Batch size | 4,608 | | |
| | Masking rate | 30% | | |
| | Optimizer | StableAdamW | | |
| | Learning rate | 5e-4 | | |
| | Weight decay | 1e-5 | | |
| | Dropout (attn output) | 0.1 | | |
| | Dropout (other) | 0.0 | | |
| | Precision | bfloat16 | | |
| | RoPE base (global attn) | 160,000 | | |
| | RoPE base (local attn) | 10,000 | | |
| ### Phase 2 — Long-context post-training (10B tokens at 8,192 context) | |
| Same hyperparameters as Phase 1, except: | |
| | Parameter | Value | | |
| |--------------------------|-----------------------------| | |
| | Training tokens | 10B | | |
| | Max sequence length | 8,192 | | |
| | Batch size | 576 | | |
| ## Related Models in the moBERTo Family | |
| | Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. | | |
| |--------------------------------------------------|-----------------------------|-----------|-------------------| | |
| | **`Tropic-AI/moBERTo-orig-tokenizer` *(this)*** | **moBERTo-8k (orig. tok.)** | Original | **Yes** | | |
| | `Tropic-AI/moBERTo` | moBERTo-SWM-8k (PT tok.) | PT (SWM) | Yes | | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{laitz2026mobertomodernencoderportuguese, | |
| title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, | |
| author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás}, | |
| year={2026}, | |
| eprint={2606.22722}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2606.22722}, | |
| } | |
| ``` |