MosaicBERT Base (Portuguese)

MosaicBERT-base pretrained from scratch on the HPLT v3 Portuguese corpus for masked language modeling. The architecture follows mosaicml/examples with ALiBi positional biases (no learned position embeddings) and the PyTorch SDPA attention kernel in place of the vendored Triton kernel.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

REPO = "eliasjacob/mosaic-bert-base-pt"

tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForMaskedLM.from_pretrained(REPO, trust_remote_code=True)

fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill(f"Brasilia e a capital do {tokenizer.mask_token}."))

trust_remote_code=True is required because the model uses ALiBi and a patched attention path that are not part of transformers.BertForMaskedLM.

Architecture

Hidden size 768
Layers 12
Attention heads 12
FFN size 3072
Max sequence length 512
Vocab size 50304 (50,280 tokens plus 24 padded dead slots)
Positional encoding ALiBi (no learned embeddings)
Attention dropout 0.0 (SDPA kernel)
Tied embeddings input embedding to MLM decoder weight

Training

Objective Masked language modeling
Steps 125000ba
Global batch 8192 sequences
Device microbatch 128
Optimizer DecoupledAdamW
Learning rate 0.0005
Betas (0.9, 0.98)
Epsilon 1e-06
Weight decay 1e-05
Scheduler Linear decay with warmup
Warmup 0.06dur
Final LR fraction 0.02
Precision bf16 (amp_bf16)
MLM probability 0.3

Data

Pretraining used the Portuguese subset of HPLT v3, pre-tokenized offline to MosaicML's MDS shard format: 339,102,441 train sequences (~166 GB at uint16). The dataset is licensed separately; consult the HPLT project page for terms.

Tokenizer

The tokenizer is eliasjacob/ModernBERT-base-portuguese. Its vocab of 50,280 was padded to 50,304 (multiple of 64) for tensor-core alignment. The 24 padding rows on the MLM decoder carry a -1e4 bias so argmax never lands on them.

The tokenizer does not emit token_type_ids. For sentence-pair tasks, both halves are sent as type 0; the model still separates them via [SEP].

Hardware

Pretraining ran on 8x NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB each). NCCL_P2P_DISABLE=1 was required to work around broken peer-to-peer transfers in the NCCL 2.26 bundled with torch==2.7.0+cu128 on consumer Blackwell (sm_120).

Limitations

  • Pretrained for masked language modeling only. No next-sentence prediction, no instruction tuning, no RLHF.
  • The tokenizer omits token_type_ids, a small handicap on sentence-pair tasks versus classical BERT.
  • Intended as a backbone for downstream fine-tuning, not for direct generation or zero-shot use.

License

Apache 2.0, matching upstream mosaicml/examples. The HPLT corpus is licensed separately and not redistributed here.

Citation

If you use this model, please cite both MosaicBERT and HPLT:

  • MosaicML, MosaicBERT: Pretraining BERT from Scratch for $20, 2023.
  • HPLT, HPLT: High-Performance Language Technologies.
Downloads last month
71
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support