MosaicBERT Base (Portuguese)

MosaicBERT-base pretrained from scratch on the HPLT v3 Portuguese corpus for masked language modeling. The architecture follows mosaicml/examples with ALiBi positional biases (no learned position embeddings) and the PyTorch SDPA attention kernel in place of the vendored Triton kernel.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

REPO = "eliasjacob/mosaic-bert-base-pt"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForMaskedLM.from_pretrained(REPO, trust_remote_code=True)

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill(f"Brasilia e a capital do {tokenizer.mask_token}."))

trust_remote_code=True is required because the model uses ALiBi and a patched attention path that are not part of transformers.BertForMaskedLM.

Architecture


Hidden size	768
Layers	12
Attention heads	12
FFN size	3072
Max sequence length	512
Vocab size	50304 (50,280 tokens plus 24 padded dead slots)
Positional encoding	ALiBi (no learned embeddings)
Attention dropout	0.0 (SDPA kernel)
Tied embeddings	input embedding to MLM decoder weight

Training


Objective	Masked language modeling
Steps	125000ba
Global batch	8192 sequences
Device microbatch	128
Optimizer	DecoupledAdamW
Learning rate	0.0005
Betas	(0.9, 0.98)
Epsilon	1e-06
Weight decay	1e-05
Scheduler	Linear decay with warmup
Warmup	0.06dur
Final LR fraction	0.02
Precision	bf16 (amp_bf16)
MLM probability	0.3

Data

Pretraining used the Portuguese subset of HPLT v3, pre-tokenized offline to MosaicML's MDS shard format: 339,102,441 train sequences (~166 GB at uint16). The dataset is licensed separately; consult the HPLT project page for terms.

Tokenizer

The tokenizer is eliasjacob/ModernBERT-base-portuguese. Its vocab of 50,280 was padded to 50,304 (multiple of 64) for tensor-core alignment. The 24 padding rows on the MLM decoder carry a -1e4 bias so argmax never lands on them.

The tokenizer does not emit token_type_ids. For sentence-pair tasks, both halves are sent as type 0; the model still separates them via [SEP].

Hardware

Pretraining ran on 8x NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB each). NCCL_P2P_DISABLE=1 was required to work around broken peer-to-peer transfers in the NCCL 2.26 bundled with torch==2.7.0+cu128 on consumer Blackwell (sm_120).

Limitations

Pretrained for masked language modeling only. No next-sentence prediction, no instruction tuning, no RLHF.
The tokenizer omits token_type_ids, a small handicap on sentence-pair tasks versus classical BERT.
Intended as a backbone for downstream fine-tuning, not for direct generation or zero-shot use.

License

Apache 2.0, matching upstream mosaicml/examples. The HPLT corpus is licensed separately and not redistributed here.

Citation

If you use this model, please cite both MosaicBERT and HPLT:

MosaicML, MosaicBERT: Pretraining BERT from Scratch for $20, 2023.
HPLT, HPLT: High-Performance Language Technologies.

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32