moBERTo / README.md

Update README.md

713e55a verified 3 days ago

13.5 kB

	---
	language:
	- pt
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- portuguese
	- encoder
	- masked-lm
	- long-context
	- moberto
	pipeline_tag: fill-mask
	base_model: answerdotai/ModernBERT-base
	datasets:
	- HuggingFaceFW/fineweb-2
	metrics:
	- nDCG@10
	- F1
	model-index:
	- name: moBERTo
	results:
	- task:
	type: text-retrieval
	name: Reranking
	dataset:
	name: QUATI
	type: quati
	metrics:
	- type: nDCG@10
	value: 0.5609
	- task:
	type: text-retrieval
	name: Reranking
	dataset:
	name: mMARCO-PT
	type: mmarco-pt
	metrics:
	- type: nDCG@10
	value: 0.5147
	- task:
	type: text-retrieval
	name: Reranking
	dataset:
	name: Robust04-PT
	type: robust04-pt
	metrics:
	- type: nDCG@10
	value: 0.5010
	- task:
	type: text-retrieval
	name: Long-Context Reranking
	dataset:
	name: MLDR (PT)
	type: mldr
	metrics:
	- type: nDCG@10
	value: 0.5777
	name: nDCG@10 at 8192 tokens
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: LeNER-Br
	type: lener-br
	metrics:
	- type: F1
	value: 0.8726
	- task:
	type: text-classification
	name: Natural Language Understanding
	dataset:
	name: PLUE-PT
	type: plue-pt
	metrics:
	- type: F1
	value: 0.6980
	---

	# moBERTo

	> Paper name: This model is referred to as `moBERTo-SWM-8k (PT tok.)` in the
	> moBERTo paper. It is the best-performing variant of the moBERTo family, achieving
	> the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks
	> and the best PLUE-PT score.

	`moBERTo` is a Portuguese adaptation of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base),
	obtained through continued pretraining on a curated 12-billion-token Portuguese corpus
	(60B training tokens, 5 epochs) followed by a long-context post-training phase at
	8,192-token context.

	It combines four adaptation strategies:

	1. Continued pretraining from the original ModernBERT-base checkpoint, preserving
	the long-context capabilities learned during the original 2T-token English pretraining.
	2. Portuguese tokenizer with vocabulary optimized for Portuguese text.
	3. Subword Matching (SWM) embedding transfer, which initializes each new Portuguese
	token's embedding as a combination of the original ModernBERT subword embeddings,
	keeping the model close to its pretrained representation space.
	4. Long-context post-training at 8,192 tokens for an additional 10B tokens.

	The model preserves all architectural advances of ModernBERT: rotary positional
	embeddings (RoPE), alternating local–global attention, FlashAttention, and unpadding,
	with a native context window of 8,192 tokens.

	---

	## Model Details

	\| Attribute \| Value \|
	\|------------------------\|------------------------------------------------------\|
	\| Architecture \| ModernBERT (encoder-only) \|
	\| Base checkpoint \| `answerdotai/ModernBERT-base` \|
	\| Parameters \| ~150M \|
	\| Max context length \| 8,192 tokens \|
	\| Tokenizer \| Portuguese (custom vocabulary) \|
	\| Embedding init \| Subword Matching (SWM) transfer \|
	\| Pretraining tokens \| 60B (5 epochs over 12B-token corpus) \|
	\| Long-context post-tr. \| 10B tokens at 8,192-token context \|
	\| Training corpus \| FineWeb-2 (PT subset) filtered with ClassiCC-PT \|
	\| Framework \| Composer \|
	\| Precision \| bfloat16 \|
	\| License \| Apache 2.0 \|

	---

	## Quick Start

	```python
	long_text = "..." # documento longo em português
	inputs = tokenizer(long_text, max_length=8192, truncation=True, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)
	```

	### Recommended for downstream tasks

	This model is best used as a backbone for fine-tuning on:

	- Cross-encoder reranking (information retrieval)
	- Document classification
	- Named entity recognition
	- Natural language inference / semantic textual similarity
	- Long-document retrieval (up to 8,192 tokens)

	---

	## Evaluation Results

	All metrics are reported on Portuguese benchmarks. Best results are in bold;
	second-best are <ins>underlined</ins>.

	### Information Retrieval (Reranking, nDCG@10)

	Cross-encoder reranking, fine-tuned on mMARCO-PT triples.

	\| Model \| QUATI \| mMARCO \| Robust04 \| Avg. \|
	\|-----------------------------\|------------------\|------------------\|------------------\|------------------\|
	\| BERT-base \| 0.2846 \| 0.4050 \| 0.2389 \| 0.3095 \|
	\| BERTimbau-base \| 0.4870 \| 0.5005 \| 0.4138 \| 0.4671 \|
	\| ModernBERT-base \| 0.3779 \| 0.4799 \| 0.2988 \| 0.3855 \|
	\| NeoBERT-base \| 0.4000 \| 0.4698 \| 0.3117 \| 0.3938 \|
	\| Qwen3-0.6B-base \| 0.4248 \| 0.5065 \| 0.2994 \| 0.4102 \|
	\| moBERTo-orig-tokenizer-1k \| 0.5383 \| 0.5109 \| 0.4510 \| 0.5001 \|
	\| moBERTo-orig-tokenizer \| 0.5231 \| 0.5089 \| 0.4516 \| 0.4945 \|
	\| moBERTo-1k \| <ins>0.5410</ins>\| 0.5169 \| <ins>0.4782</ins>\| <ins>0.5120</ins>\|
	\| moBERTo (this model) \| 0.5609 \| <ins>0.5147</ins>\| 0.5010 \| 0.5255 \|

	### Long-Context Retrieval (MLDR, nDCG@10)

	\| Model \| 512 \| 2,048 \| 4,096 \| 8,192 \|
	\|-----------------------------\|------------------\|------------------\|------------------\|------------------\|
	\| ModernBERT-base \| 0.4054 \| 0.4206 \| 0.3015 \| 0.2867 \|
	\| NeoBERT-base \| 0.4746 \| 0.5149 \| 0.4676 \| -- \|
	\| Qwen3-0.6B-base \| 0.3560 \| 0.4023 \| 0.4241 \| 0.5351 \|
	\| moBERTo-orig-tokenizer-1k \| 0.5834 \| <ins>0.5909</ins>\| 0.6286 \| 0.6166 \|
	\| moBERTo-orig-tokenizer \| 0.5674 \| 0.6025 \| 0.5876 \| <ins>0.6140</ins>\|
	\| moBERTo-1k \| 0.5466 \| 0.4791 \| 0.5714 \| 0.5857 \|
	\| moBERTo (this model) \| <ins>0.5827</ins>\| 0.5606 \| <ins>0.5905</ins>\| 0.5777 \|

	### Classification (F1)

	- Docs: document type classification (news, legal, academic, etc.)
	- Educ.: educational content detection

	\| Model \| Docs \| Educ. \| Avg. \|
	\|-----------------------------\|------------------\|------------------\|------------------\|
	\| BERT-base \| 0.8700 \| 0.5690 \| 0.7195 \|
	\| BERTimbau-base \| 0.8978 \| 0.6382 \| 0.7680 \|
	\| ModernBERT-base \| 0.8416 \| 0.5730 \| 0.7073 \|
	\| NeoBERT-base \| 0.8970 \| 0.6266 \| 0.7618 \|
	\| Qwen3-0.6B-base \| 0.9120 \| 0.6289 \| 0.7705 \|
	\| moBERTo-orig-tokenizer-1k \| 0.8942 \| 0.6070 \| 0.7506 \|
	\| moBERTo-orig-tokenizer \| 0.8962 \| 0.6035 \| 0.7499 \|
	\| moBERTo-1k \| 0.9024 \| 0.6281 \| 0.7653 \|
	\| moBERTo (this model) \| 0.9039 \| <ins>0.6394</ins>\| <ins>0.7717</ins>\|
	\| NeoBERT-PT \| 0.9030 \| 0.6428 \| 0.7729 \|
	\| Qwen3-0.6B-PT \| <ins>0.9070</ins>\| 0.6311 \| 0.7691 \|

	### NLU and NER (F1)

	\| Model \| PLUE-PT \| LeNER-Br \| GLUE (English) \|
	\|-----------------------------\|------------------\|------------------\|------------------\|
	\| BERT-base \| 0.6423 \| 0.8500 \| <ins>0.7815</ins>\|
	\| BERTimbau-base \| 0.6800 \| 0.9040 \| 0.6772 \|
	\| ModernBERT-base \| 0.6420 \| 0.8240 \| 0.8301 \|
	\| NeoBERT-base \| 0.6654 \| 0.8590 \| 0.7430 \|
	\| Qwen3-0.6B-base \| 0.6343 \| 0.7020 \| 0.7260 \|
	\| moBERTo-orig-tokenizer-1k \| 0.6849 \| 0.8371 \| 0.7705 \|
	\| moBERTo-orig-tokenizer \| 0.6910 \| 0.8587 \| 0.7724 \|
	\| moBERTo-1k \| <ins>0.6959</ins>\| 0.8710 \| 0.7128 \|
	\| moBERTo (this model) \| 0.6980 \| 0.8726 \| 0.7354 \|
	\| NeoBERT-PT \| 0.6842 \| <ins>0.8840</ins>\| 0.6620 \|
	\| Qwen3-0.6B-PT \| 0.6632 \| 0.7100 \| 0.7050 \|

	> Note on GLUE: As expected from continued pretraining on Portuguese, English
	> performance degrades. ModernBERT-base remains the strongest on GLUE (0.8301);

	---

	## Key Findings (from the paper's ablations)

	1. Continued pretraining > training from scratch. Especially for long-context:
	`moBERTo` achieves 0.5777 on MLDR@8192 vs. 0.1405 for a from-scratch baseline
	trained on the same Portuguese budget. The original 2T-token ModernBERT pretraining
	provides representations that transfer effectively even when continued pretraining
	itself uses only 1,024-token sequences.
	2. Tokenizer adaptation helps token-level tasks but disrupts long context. Moving
	to a Portuguese tokenizer improves PLUE-PT and LeNER-Br but hurts MLDR@8192 (drops
	by ~11 points without embedding transfer).
	3. SWM embedding transfer mitigates the long-context degradation. By initializing
	new Portuguese embeddings as combinations of the original subword embeddings, SWM
	recovers most of the long-context performance lost by tokenizer adaptation alone.
	4. Long-context post-training yields the strongest reranker. `moBERTo`
	(this model) achieves the highest average reranking nDCG@10 (0.5255) and the best
	PLUE-PT score (0.6980).

	---

	## Training Data

	The pretraining corpus was curated from the Portuguese subset of
	[FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and further
	filtered using the educational and STEM classifiers from ClassiCC-PT. The final
	corpus comprises ~12 billion tokens, roughly six times larger than BrWaC,
	covering a broad range of domains and topics in Portuguese.

	The training data has been publicly released alongside the model.

	---

	## Training Procedure

	### Phase 1 — Continued pretraining (60B tokens at 1,024 context)

	\| Parameter \| Value \|
	\|--------------------------\|-----------------------------\|
	\| Training tokens \| 60B (5 epochs over 12B) \|
	\| Max sequence length \| 1,024 \|
	\| Batch size \| 4,608 \|
	\| Masking rate \| 30% \|
	\| Optimizer \| StableAdamW \|
	\| Learning rate \| 5e-4 \|
	\| Weight decay \| 1e-5 \|
	\| Dropout (attn output) \| 0.1 \|
	\| Dropout (other) \| 0.0 \|
	\| Precision \| bfloat16 \|
	\| RoPE base (global attn) \| 160,000 \|
	\| RoPE base (local attn) \| 10,000 \|

	### Phase 2 — Long-context post-training (10B tokens at 8,192 context)

	Same hyperparameters as Phase 1, except:

	\| Parameter \| Value \|
	\|--------------------------\|-----------------------------\|
	\| Training tokens \| 10B \|
	\| Max sequence length \| 8,192 \|
	\| Batch size \| 576 \|

	---

	## Related Models in the moBERTo Family

	\| Hugging Face Repo \| Paper Name \| Tokenizer \| Long-ctx post-tr. \|
	\|------------------------------------------------\|-----------------------------\|-----------\|-------------------\|
	\| `Tropic-AI/moBERTo-orig-tokenizer` \| moBERTo-8k (orig. tok.) \| Original \| Yes \|
	\| *`Tropic-AI/moBERTo` (this)* * \| moBERTo-SWM-8k (PT tok.)\| PT (SWM) \| Yes \|

	---

	## Citation

	```bibtex
	@misc{laitz2026mobertomodernencoderportuguese,
	title={moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT},
	author={Thiago Laitz and Thales Sales Almeida and João Guilherme Alves Santos and Giovana Kerche Bonás},
	year={2026},
	eprint={2606.22722},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2606.22722},
	}
	```