--- language: - pt license: apache-2.0 library_name: transformers tags: - modernbert - portuguese - encoder - masked-lm - long-context - moberto pipeline_tag: fill-mask base_model: answerdotai/ModernBERT-base datasets: - HuggingFaceFW/fineweb-2 metrics: - nDCG@10 - F1 model-index: - name: moBERTo results: - task: type: text-retrieval name: Reranking dataset: name: QUATI type: quati metrics: - type: nDCG@10 value: 0.5609 - task: type: text-retrieval name: Reranking dataset: name: mMARCO-PT type: mmarco-pt metrics: - type: nDCG@10 value: 0.5147 - task: type: text-retrieval name: Reranking dataset: name: Robust04-PT type: robust04-pt metrics: - type: nDCG@10 value: 0.5010 - task: type: text-retrieval name: Long-Context Reranking dataset: name: MLDR (PT) type: mldr metrics: - type: nDCG@10 value: 0.5777 name: nDCG@10 at 8192 tokens - task: type: token-classification name: Named Entity Recognition dataset: name: LeNER-Br type: lener-br metrics: - type: F1 value: 0.8726 - task: type: text-classification name: Natural Language Understanding dataset: name: PLUE-PT type: plue-pt metrics: - type: F1 value: 0.6980 --- # moBERTo > **Paper name:** This model is referred to as **`moBERTo-SWM-8k (PT tok.)`** in the > moBERTo paper. It is the **best-performing variant** of the moBERTo family, achieving > the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks > and the best PLUE-PT score. `moBERTo` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base), obtained through continued pretraining on a curated 12-billion-token Portuguese corpus (60B training tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context. It combines four adaptation strategies: 1. **Continued pretraining** from the original ModernBERT-base checkpoint, preserving the long-context capabilities learned during the original 2T-token English pretraining. 2. **Portuguese tokenizer** with vocabulary optimized for Portuguese text. 3. **Subword Matching (SWM) embedding transfer**, which initializes each new Portuguese token's embedding as a combination of the original ModernBERT subword embeddings, keeping the model close to its pretrained representation space. 4. **Long-context post-training** at 8,192 tokens for an additional 10B tokens. The model preserves all architectural advances of ModernBERT: rotary positional embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding, with a native context window of **8,192 tokens**. --- ## Model Details | Attribute | Value | |------------------------|------------------------------------------------------| | Architecture | ModernBERT (encoder-only) | | Base checkpoint | `answerdotai/ModernBERT-base` | | Parameters | ~150M | | Max context length | 8,192 tokens | | Tokenizer | Portuguese (custom vocabulary) | | Embedding init | Subword Matching (SWM) transfer | | Pretraining tokens | 60B (5 epochs over 12B-token corpus) | | Long-context post-tr. | 10B tokens at 8,192-token context | | Training corpus | FineWeb-2 (PT subset) filtered with ClassiCC-PT | | Framework | Composer | | Precision | bfloat16 | | License | Apache 2.0 | --- ## Quick Start ```python long_text = "..." # documento longo em português inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) ``` ### Recommended for downstream tasks This model is best used as a backbone for fine-tuning on: - **Cross-encoder reranking** (information retrieval) - **Document classification** - **Named entity recognition** - **Natural language inference / semantic textual similarity** - **Long-document retrieval** (up to 8,192 tokens) --- ## Evaluation Results All metrics are reported on Portuguese benchmarks. Best results are in **bold**; second-best are underlined. ### Information Retrieval (Reranking, nDCG@10) Cross-encoder reranking, fine-tuned on mMARCO-PT triples. | Model | QUATI | mMARCO | Robust04 | **Avg.** | |-----------------------------|------------------|------------------|------------------|------------------| | BERT-base | 0.2846 | 0.4050 | 0.2389 | 0.3095 | | BERTimbau-base | 0.4870 | 0.5005 | 0.4138 | 0.4671 | | ModernBERT-base | 0.3779 | 0.4799 | 0.2988 | 0.3855 | | NeoBERT-base | 0.4000 | 0.4698 | 0.3117 | 0.3938 | | Qwen3-0.6B-base | 0.4248 | 0.5065 | 0.2994 | 0.4102 | | moBERTo-orig-tokenizer-1k | 0.5383 | 0.5109 | 0.4510 | 0.5001 | | moBERTo-orig-tokenizer | 0.5231 | 0.5089 | 0.4516 | 0.4945 | | moBERTo-1k | 0.5410| **0.5169** | 0.4782| 0.5120| | **moBERTo (this model)** | **0.5609** | 0.5147| **0.5010** | **0.5255** | ### Long-Context Retrieval (MLDR, nDCG@10) | Model | 512 | 2,048 | 4,096 | 8,192 | |-----------------------------|------------------|------------------|------------------|------------------| | ModernBERT-base | 0.4054 | 0.4206 | 0.3015 | 0.2867 | | NeoBERT-base | 0.4746 | 0.5149 | 0.4676 | -- | | Qwen3-0.6B-base | 0.3560 | 0.4023 | 0.4241 | 0.5351 | | moBERTo-orig-tokenizer-1k | **0.5834** | 0.5909| **0.6286** | **0.6166** | | moBERTo-orig-tokenizer | 0.5674 | **0.6025** | 0.5876 | 0.6140| | moBERTo-1k | 0.5466 | 0.4791 | 0.5714 | 0.5857 | | **moBERTo (this model)** | 0.5827| 0.5606 | 0.5905| 0.5777 | ### Classification (F1) - **Docs**: document type classification (news, legal, academic, etc.) - **Educ.**: educational content detection | Model | Docs | Educ. | **Avg.** | |-----------------------------|------------------|------------------|------------------| | BERT-base | 0.8700 | 0.5690 | 0.7195 | | BERTimbau-base | 0.8978 | 0.6382 | 0.7680 | | ModernBERT-base | 0.8416 | 0.5730 | 0.7073 | | NeoBERT-base | 0.8970 | 0.6266 | 0.7618 | | Qwen3-0.6B-base | **0.9120** | 0.6289 | 0.7705 | | moBERTo-orig-tokenizer-1k | 0.8942 | 0.6070 | 0.7506 | | moBERTo-orig-tokenizer | 0.8962 | 0.6035 | 0.7499 | | moBERTo-1k | 0.9024 | 0.6281 | 0.7653 | | **moBERTo (this model)** | 0.9039 | 0.6394| 0.7717| | NeoBERT-PT | 0.9030 | **0.6428** | **0.7729** | | Qwen3-0.6B-PT | 0.9070| 0.6311 | 0.7691 | ### NLU and NER (F1) | Model | PLUE-PT | LeNER-Br | GLUE (English) | |-----------------------------|------------------|------------------|------------------| | BERT-base | 0.6423 | 0.8500 | 0.7815| | BERTimbau-base | 0.6800 | **0.9040** | 0.6772 | | ModernBERT-base | 0.6420 | 0.8240 | **0.8301** | | NeoBERT-base | 0.6654 | 0.8590 | 0.7430 | | Qwen3-0.6B-base | 0.6343 | 0.7020 | 0.7260 | | moBERTo-orig-tokenizer-1k | 0.6849 | 0.8371 | 0.7705 | | moBERTo-orig-tokenizer | 0.6910 | 0.8587 | 0.7724 | | moBERTo-1k | 0.6959| 0.8710 | 0.7128 | | **moBERTo (this model)** | **0.6980** | 0.8726 | 0.7354 | | NeoBERT-PT | 0.6842 | 0.8840| 0.6620 | | Qwen3-0.6B-PT | 0.6632 | 0.7100 | 0.7050 | > **Note on GLUE:** As expected from continued pretraining on Portuguese, English > performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301); --- ## Key Findings (from the paper's ablations) 1. **Continued pretraining > training from scratch.** Especially for long-context: `moBERTo` achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining provides representations that transfer effectively even when continued pretraining itself uses only 1,024-token sequences. 2. **Tokenizer adaptation helps token-level tasks but disrupts long context.** Moving to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops by ~11 points without embedding transfer). 3. **SWM embedding transfer mitigates the long-context degradation.** By initializing new Portuguese embeddings as combinations of the original subword embeddings, SWM recovers most of the long-context performance lost by tokenizer adaptation alone. 4. **Long-context post-training yields the strongest reranker.** `moBERTo` (this model) achieves the highest average reranking nDCG@10 (0.5255) and the best PLUE-PT score (0.6980). --- ## Training Data The pretraining corpus was curated from the Portuguese subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further filtered using the educational and STEM classifiers from ClassiCC-PT. The final corpus comprises **~12 billion tokens**, roughly six times larger than BrWaC, covering a broad range of domains and topics in Portuguese. The training data has been publicly released alongside the model. --- ## Training Procedure ### Phase 1 — Continued pretraining (60B tokens at 1,024 context) | Parameter | Value | |--------------------------|-----------------------------| | Training tokens | 60B (5 epochs over 12B) | | Max sequence length | 1,024 | | Batch size | 4,608 | | Masking rate | 30% | | Optimizer | StableAdamW | | Learning rate | 5e-4 | | Weight decay | 1e-5 | | Dropout (attn output) | 0.1 | | Dropout (other) | 0.0 | | Precision | bfloat16 | | RoPE base (global attn) | 160,000 | | RoPE base (local attn) | 10,000 | ### Phase 2 — Long-context post-training (10B tokens at 8,192 context) Same hyperparameters as Phase 1, except: | Parameter | Value | |--------------------------|-----------------------------| | Training tokens | 10B | | Max sequence length | 8,192 | | Batch size | 576 | --- ## Related Models in the moBERTo Family | Hugging Face Repo | Paper Name | Tokenizer | Long-ctx post-tr. | |------------------------------------------------|-----------------------------|-----------|-------------------| | `Tropic-AI/moBERTo-orig-tokenizer` | moBERTo-8k (orig. tok.) | Original | Yes | | **`Tropic-AI/moBERTo` *(this)* * | **moBERTo-SWM-8k (PT tok.)**| PT (SWM) | **Yes** | --- ## Citation ```bibtex @misc{laitz2026mobertomodernencoderportuguese, title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT}, author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás}, year={2026}, eprint={2606.22722}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2606.22722}, } ```