Update README.md

99fd991 verified 10 days ago

12.5 kB

	---
	language:
	- pt
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- portuguese
	- encoder
	- masked-lm
	- long-context
	- moberto
	pipeline_tag: fill-mask
	base_model: answerdotai/ModernBERT-base
	datasets:
	- HuggingFaceFW/fineweb-2
	metrics:
	- nDCG@10
	- F1
	model-index:
	- name: moBERTo-orig-tokenizer
	results:
	- task:
	type: text-retrieval
	name: Reranking
	dataset:
	name: QUATI
	type: quati
	metrics:
	- type: nDCG@10
	value: 0.5231
	- task:
	type: text-retrieval
	name: Reranking
	dataset:
	name: mMARCO-PT
	type: mmarco-pt
	metrics:
	- type: nDCG@10
	value: 0.5089
	- task:
	type: text-retrieval
	name: Reranking
	dataset:
	name: Robust04-PT
	type: robust04-pt
	metrics:
	- type: nDCG@10
	value: 0.4516
	- task:
	type: text-retrieval
	name: Long-Context Reranking
	dataset:
	name: MLDR (PT)
	type: mldr
	metrics:
	- type: nDCG@10
	value: 0.6140
	name: nDCG@10 at 8192 tokens
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: LeNER-Br
	type: lener-br
	metrics:
	- type: F1
	value: 0.8587
	- task:
	type: text-classification
	name: Natural Language Understanding
	dataset:
	name: PLUE-PT
	type: plue-pt
	metrics:
	- type: F1
	value: 0.6910
	---

	# moBERTo-orig-tokenizer

	> Paper name: This model is referred to as `moBERTo-8k (orig. tok.)` in the
	> moBERTo paper. It is the variant that retains the original ModernBERT tokenizer,
	> followed by long-context post-training.

	`moBERTo-orig-tokenizer` is a Portuguese adaptation of
	[ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base), obtained through
	continued pretraining on a curated 12-billion-token Portuguese corpus (60B training
	tokens, 5 epochs) followed by a long-context post-training phase at 8,192-token context.

	Unlike the flagship [`moBERTo`](https://huggingface.co/Tropic-AI/moBERTo) variant,
	this model keeps the original ModernBERT tokenizer rather than adopting a
	Portuguese-optimized one. This makes it particularly strong for long-context
	retrieval tasks (where it achieves the best results in the moBERTo family at
	8,192 tokens on MLDR), at the cost of slightly weaker token-level performance
	(NER and NLU).

	The model preserves all architectural advances of ModernBERT: rotary positional
	embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding,
	with a native context window of 8,192 tokens.

	---

	## Model Details

	\| Attribute \| Value \|
	\|------------------------\|------------------------------------------------------\|
	\| Architecture \| ModernBERT (encoder-only) \|
	\| Base checkpoint \| `answerdotai/ModernBERT-base` \|
	\| Parameters \| ~150M \|
	\| Max context length \| 8,192 tokens \|
	\| Tokenizer \| Original ModernBERT tokenizer (unchanged) \|
	\| Embedding init \| Inherited from ModernBERT-base \|
	\| Pretraining tokens \| 60B (5 epochs over 12B-token corpus) \|
	\| Long-context post-tr. \| 10B tokens at 8,192-token context \|
	\| Training corpus \| FineWeb-2 (PT subset) filtered with ClassiCC-PT \|
	\| Framework \| Composer \|
	\| Precision \| bfloat16 \|
	\| License \| Apache 2.0 \|

	---

	## Quick Start

	```python
	long_text = "..." # documento longo em português
	inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)
	```

	### Recommended for downstream tasks

	This model is best used as a backbone for fine-tuning on:

	- Long-document retrieval (its strongest use case)
	- Cross-encoder reranking (information retrieval)
	- Document classification
	- Named entity recognition
	- Natural language inference / semantic textual similarity

	---

	## Evaluation Results

	All metrics are reported on Portuguese benchmarks. Best results are in bold;
	second-best are <ins>underlined</ins>.

	### Information Retrieval (Reranking, nDCG@10)

	Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

	\| Model \| QUATI \| mMARCO \| Robust04 \| Avg. \|
	\|----------------------------------------\|------------------\|------------------\|------------------\|------------------\|
	\| BERT-base \| 0.2846 \| 0.4050 \| 0.2389 \| 0.3095 \|
	\| BERTimbau-base \| 0.4870 \| 0.5005 \| 0.4138 \| 0.4671 \|
	\| ModernBERT-base \| 0.3779 \| 0.4799 \| 0.2988 \| 0.3855 \|
	\| NeoBERT-base \| 0.4000 \| 0.4698 \| 0.3117 \| 0.3938 \|
	\| Qwen3-0.6B-base \| 0.4248 \| 0.5065 \| 0.2994 \| 0.4102 \|
	\| moBERTo-orig-tokenizer-1k \| 0.5383 \| 0.5109 \| 0.4510 \| 0.5001 \|
	\| moBERTo-orig-tokenizer (this model)\| 0.5231 \| 0.5089 \| 0.4516 \| 0.4945 \|
	\| moBERTo-1k \| <ins>0.5410</ins>\| 0.5169 \| <ins>0.4782</ins>\| <ins>0.5120</ins>\|
	\| moBERTo \| 0.5609 \| <ins>0.5147</ins>\| 0.5010 \| 0.5255 \|

	### Long-Context Retrieval (MLDR, nDCG@10)

	\| Model \| 512 \| 2,048 \| 4,096 \| 8,192 \|
	\|----------------------------------------\|------------------\|------------------\|------------------\|------------------\|
	\| ModernBERT-base \| 0.4054 \| 0.4206 \| 0.3015 \| 0.2867 \|
	\| NeoBERT-base \| 0.4746 \| 0.5149 \| 0.4676 \| -- \|
	\| Qwen3-0.6B-base \| 0.3560 \| 0.4023 \| 0.4241 \| 0.5351 \|
	\| moBERTo-orig-tokenizer-1k \| 0.5834 \| <ins>0.5909</ins>\| 0.6286 \| 0.6166 \|
	\| moBERTo-orig-tokenizer (this model)\| 0.5674 \| 0.6025 \| 0.5876 \| <ins>0.6140</ins>\|
	\| moBERTo-1k \| 0.5466 \| 0.4791 \| 0.5714 \| 0.5857 \|
	\| moBERTo \| <ins>0.5827</ins>\| 0.5606 \| <ins>0.5905</ins>\| 0.5777 \|

	### Classification (F1)

	- Docs: document type classification (news, legal, academic, etc.)
	- Educ.: educational content detection

	\| Model \| Docs \| Educ. \| Avg. \|
	\|----------------------------------------\|------------------\|------------------\|------------------\|
	\| BERT-base \| 0.8700 \| 0.5690 \| 0.7195 \|
	\| BERTimbau-base \| 0.8978 \| 0.6382 \| 0.7680 \|
	\| ModernBERT-base \| 0.8416 \| 0.5730 \| 0.7073 \|
	\| NeoBERT-base \| 0.8970 \| 0.6266 \| 0.7618 \|
	\| Qwen3-0.6B-base \| 0.9120 \| 0.6289 \| 0.7705 \|
	\| moBERTo-orig-tokenizer-1k \| 0.8942 \| 0.6070 \| 0.7506 \|
	\| moBERTo-orig-tokenizer (this model)\| 0.8962 \| 0.6035 \| 0.7499 \|
	\| moBERTo-1k \| 0.9024 \| 0.6281 \| 0.7653 \|
	\| moBERTo \| 0.9039 \| <ins>0.6394</ins>\| <ins>0.7717</ins>\|
	\| NeoBERT-PT \| 0.9030 \| 0.6428 \| 0.7729 \|
	\| Qwen3-0.6B-PT \| <ins>0.9070</ins>\| 0.6311 \| 0.7691 \|

	### NLU and NER (F1)

	\| Model \| PLUE-PT \| LeNER-Br \| GLUE (English) \|
	\|----------------------------------------\|------------------\|------------------\|------------------\|
	\| BERT-base \| 0.6423 \| 0.8500 \| <ins>0.7815</ins>\|
	\| BERTimbau-base \| 0.6800 \| 0.9040 \| 0.6772 \|
	\| ModernBERT-base \| 0.6420 \| 0.8240 \| 0.8301 \|
	\| NeoBERT-base \| 0.6654 \| 0.8590 \| 0.7430 \|
	\| Qwen3-0.6B-base \| 0.6343 \| 0.7020 \| 0.7260 \|
	\| moBERTo-orig-tokenizer-1k \| 0.6849 \| 0.8371 \| 0.7705 \|
	\| moBERTo-orig-tokenizer (this model)\| 0.6910 \| 0.8587 \| 0.7724 \|
	\| moBERTo-1k \| <ins>0.6959</ins>\| 0.8710 \| 0.7128 \|
	\| moBERTo \| 0.6980 \| 0.8726 \| 0.7354 \|
	\| NeoBERT-PT \| 0.6842 \| <ins>0.8840</ins>\| 0.6620 \|
	\| Qwen3-0.6B-PT \| 0.6632 \| 0.7100 \| 0.7050 \|


	## Training Data

	The pretraining corpus was curated from the Portuguese subset of
	[FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further
	filtered using the educational and STEM classifiers from ClassiCC-PT. The final
	corpus comprises ~12 billion tokens, roughly six times larger than BrWaC,
	covering a broad range of domains and topics in Portuguese.

	The training data has been publicly released alongside the model.

	---

	## Training Procedure

	### Phase 1 — Continued pretraining (60B tokens at 1,024 context)

	\| Parameter \| Value \|
	\|--------------------------\|-----------------------------\|
	\| Training tokens \| 60B (5 epochs over 12B) \|
	\| Max sequence length \| 1,024 \|
	\| Batch size \| 4,608 \|
	\| Masking rate \| 30% \|
	\| Optimizer \| StableAdamW \|
	\| Learning rate \| 5e-4 \|
	\| Weight decay \| 1e-5 \|
	\| Dropout (attn output) \| 0.1 \|
	\| Dropout (other) \| 0.0 \|
	\| Precision \| bfloat16 \|
	\| RoPE base (global attn) \| 160,000 \|
	\| RoPE base (local attn) \| 10,000 \|

	### Phase 2 — Long-context post-training (10B tokens at 8,192 context)

	Same hyperparameters as Phase 1, except:

	\| Parameter \| Value \|
	\|--------------------------\|-----------------------------\|
	\| Training tokens \| 10B \|
	\| Max sequence length \| 8,192 \|
	\| Batch size \| 576 \|



	## Related Models in the moBERTo Family

	\| Hugging Face Repo \| Paper Name \| Tokenizer \| Long-ctx post-tr. \|
	\|--------------------------------------------------\|-----------------------------\|-----------\|-------------------\|
	\| *`Tropic-AI/moBERTo-orig-tokenizer` (this)* \| moBERTo-8k (orig. tok.) \| Original \| Yes** \|
	\| `Tropic-AI/moBERTo` \| moBERTo-SWM-8k (PT tok.) \| PT (SWM) \| Yes \|

	---

	## Citation

	```bibtex
	@misc{laitz2026mobertomodernencoderportuguese,
	title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT},
	author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
	year={2026},
	eprint={2606.22722},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2606.22722},
	}
	```