--- language: - pt license: apache-2.0 library_name: transformers tags: - modernbert - portuguese - encoder - masked-lm - long-context - moberto pipeline_tag: fill-mask base_model: answerdotai/ModernBERT-base datasets: - HuggingFaceFW/fineweb-2 metrics: - nDCG@10 - F1 model-index: - name: moBERTo-orig-tokenizer results: - task: type: text-retrieval name: Reranking dataset: name: QUATI type: quati metrics: - type: nDCG@10 value: 0.5231 - task: type: text-retrieval name: Reranking dataset: name: mMARCO-PT type: mmarco-pt metrics: - type: nDCG@10 value: 0.5089 - task: type: text-retrieval name: Reranking dataset: name: Robust04-PT type: robust04-pt metrics: - type: nDCG@10 value: 0.4516 - task: type: text-retrieval name: Long-Context Reranking dataset: name: MLDR (PT) type: mldr metrics: - type: nDCG@10 value: 0.6140 name: nDCG@10 at 8192 tokens - task: type: token-classification name: Named Entity Recognition dataset: name: LeNER-Br type: lener-br metrics: - type: F1 value: 0.8587 - task: type: text-classification name: Natural Language Understanding dataset: name: PLUE-PT type: plue-pt metrics: - type: F1 value: 0.6910 --- # moBERTo-orig-tokenizer > **Paper name:** This model is referred to as **`moBERTo-8k (orig. tok.)`** in the > moBERTo paper. It is the variant that retains the **original ModernBERT tokenizer**, > followed by long-context post-training. `moBERTo-orig-tokenizer` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base), obtained through continued pretraining on a curated 12-billion-token Portuguese corpus (60B training tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context. Unlike the flagship [`moBERTo`](https://huggingface.co/Tropic-AI/moBERTo) variant, this model **keeps the original ModernBERT tokenizer** rather than adopting a Portuguese-optimized one. This makes it particularly strong for long-context retrieval tasks (where it achieves the **best results in the moBERTo family** at 8,192 tokens on MLDR), at the cost of slightly weaker token-level performance (NER and NLU). The model preserves all architectural advances of ModernBERT: rotary positional embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, with a native context window of **8,192 tokens**. --- ## Model Details | Attribute | Value | |------------------------|------------------------------------------------------| | Architecture | ModernBERT (encoder-only) | | Base checkpoint | `answerdotai/ModernBERT-base` | | Parameters | ~150M | | Max context length | 8,192 tokens | | Tokenizer | **Original ModernBERT tokenizer (unchanged)** | | Embedding init | Inherited from ModernBERT-base | | Pretraining tokens | 60B (5 epochs over 12B-token corpus) | | Long-context post-tr. | 10B tokens at 8,192-token context | | Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT | | Framework | Composer | | Precision | bfloat16 | | License | Apache 2.0 | --- ## Quick Start ```python long_text = "..." # documento longo em português inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) ``` ### Recommended for downstream tasks This model is best used as a backbone for fine-tuning on: - **Long-document retrieval** (its strongest use case) - **Cross-encoder reranking** (information retrieval) - **Document classification** - **Named entity recognition** - **Natural language inference / semantic textual similarity** --- ## Evaluation Results All metrics are reported on Portuguese benchmarks. Best results are in **bold**; second-best are underlined. ### Information Retrieval (Reranking, nDCG@10) Cross-encoder reranking, fine-tuned on mMARCO-PT triples. | Model | QUATI | mMARCO | Robust04 | **Avg.** | |----------------------------------------|------------------|------------------|------------------|------------------| | BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 | | BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 | | ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 | | NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 | | Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 | | moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 | | **moBERTo-orig-tokenizer (this model)**| 0.5231 | 0.5089 | 0.4516 | 0.4945 | | moBERTo-1k | 0.5410| **0.5169** | 0.4782| 0.5120| | moBERTo | **0.5609** | 0.5147| **0.5010** | **0.5255** | ### Long-Context Retrieval (MLDR, nDCG@10) | Model | 512 | 2,048 | 4,096 | 8,192 | |----------------------------------------|------------------|------------------|------------------|------------------| | ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 | | NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- | | Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 | | moBERTo-orig-tokenizer-1k | **0.5834** | 0.5909| **0.6286** | **0.6166** | | **moBERTo-orig-tokenizer (this model)**| 0.5674 | **0.6025** | 0.5876 | 0.6140| | moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 | | moBERTo | 0.5827| 0.5606 | 0.5905| 0.5777 | ### Classification (F1) - **Docs**: document type classification (news, legal, academic, etc.) - **Educ.**: educational content detection | Model | Docs | Educ. | **Avg.** | |----------------------------------------|------------------|------------------|------------------| | BERT-base | 0.8700 | 0.5690 | 0.7195 | | BERTimbau-base | 0.8978 | 0.6382 | 0.7680 | | ModernBERT-base | 0.8416 | 0.5730 | 0.7073 | | NeoBERT-base | 0.8970 | 0.6266 | 0.7618 | | Qwen3-0.6B-base | **0.9120** | 0.6289 | 0.7705 | | moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 | | **moBERTo-orig-tokenizer (this model)**| 0.8962 | 0.6035 | 0.7499 | | moBERTo-1k | 0.9024 | 0.6281 | 0.7653 | | moBERTo | 0.9039 | 0.6394| 0.7717| | NeoBERT-PT | 0.9030 | **0.6428** | **0.7729** | | Qwen3-0.6B-PT | 0.9070| 0.6311 | 0.7691 | ### NLU and NER (F1) | Model | PLUE-PT | LeNER-Br | GLUE (English) | |----------------------------------------|------------------|------------------|------------------| | BERT-base | 0.6423 | 0.8500 | 0.7815| | BERTimbau-base | 0.6800 | **0.9040** | 0.6772 | | ModernBERT-base | 0.6420 | 0.8240 | **0.8301** | | NeoBERT-base | 0.6654 | 0.8590 | 0.7430 | | Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 | | moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 | | **moBERTo-orig-tokenizer (this model)**| 0.6910 | 0.8587 | 0.7724 | | moBERTo-1k | 0.6959| 0.8710 | 0.7128 | | moBERTo | **0.6980** | 0.8726 | 0.7354 | | NeoBERT-PT | 0.6842 | 0.8840| 0.6620 | | Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 | ## Training Data The pretraining corpus was curated from the Portuguese subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further filtered using the educational and STEM classifiers from ClassiCC-PT. The final corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC, covering a broad range of domains and topics in Portuguese. The training data has been publicly released alongside the model. --- ## Training Procedure ### Phase 1 — Continued pretraining (60B tokens at 1,024 context) | Parameter | Value | |--------------------------|-----------------------------| | Training tokens | 60B (5 epochs over 12B) | | Max sequence length | 1,024 | | Batch size | 4,608 | | Masking rate | 30% | | Optimizer | StableAdamW | | Learning rate | 5e-4 | | Weight decay | 1e-5 | | Dropout (attn output) | 0.1 | | Dropout (other) | 0.0 | | Precision | bfloat16 | | RoPE base (global attn) | 160,000 | | RoPE base (local attn) | 10,000 | ### Phase 2 — Long-context post-training (10B tokens at 8,192 context) Same hyperparameters as Phase 1, except: | Parameter | Value | |--------------------------|-----------------------------| | Training tokens | 10B | | Max sequence length | 8,192 | | Batch size | 576 | ## Related Models in the moBERTo Family | Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. | |--------------------------------------------------|-----------------------------|-----------|-------------------| | **`Tropic-AI/moBERTo-orig-tokenizer` *(this)*** | **moBERTo-8k (orig. tok.)** | Original | **Yes** | | `Tropic-AI/moBERTo` | moBERTo-SWM-8k (PT tok.) | PT (SWM) | Yes | --- ## Citation ```bibtex @misc{laitz2026mobertomodernencoderportuguese, title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás}, year={2026}, eprint={2606.22722}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2606.22722}, } ```