Instructions to use eliasjacob/mosaic-bert-base-pt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use eliasjacob/mosaic-bert-base-pt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="eliasjacob/mosaic-bert-base-pt", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("eliasjacob/mosaic-bert-base-pt", trust_remote_code=True) model = AutoModelForMaskedLM.from_pretrained("eliasjacob/mosaic-bert-base-pt", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
MosaicBERT Base (Portuguese)
MosaicBERT-base pretrained from scratch on the HPLT v3 Portuguese corpus for masked language modeling. The architecture follows mosaicml/examples with ALiBi positional biases (no learned position embeddings) and the PyTorch SDPA attention kernel in place of the vendored Triton kernel.
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
REPO = "eliasjacob/mosaic-bert-base-pt"
tokenizer = AutoTokenizer.from_pretrained(REPO)
model = AutoModelForMaskedLM.from_pretrained(REPO, trust_remote_code=True)
fill = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill(f"Brasilia e a capital do {tokenizer.mask_token}."))
trust_remote_code=True is required because the model uses ALiBi and a
patched attention path that are not part of transformers.BertForMaskedLM.
Architecture
| Hidden size | 768 |
| Layers | 12 |
| Attention heads | 12 |
| FFN size | 3072 |
| Max sequence length | 512 |
| Vocab size | 50304 (50,280 tokens plus 24 padded dead slots) |
| Positional encoding | ALiBi (no learned embeddings) |
| Attention dropout | 0.0 (SDPA kernel) |
| Tied embeddings | input embedding to MLM decoder weight |
Training
| Objective | Masked language modeling |
| Steps | 125000ba |
| Global batch | 8192 sequences |
| Device microbatch | 128 |
| Optimizer | DecoupledAdamW |
| Learning rate | 0.0005 |
| Betas | (0.9, 0.98) |
| Epsilon | 1e-06 |
| Weight decay | 1e-05 |
| Scheduler | Linear decay with warmup |
| Warmup | 0.06dur |
| Final LR fraction | 0.02 |
| Precision | bf16 (amp_bf16) |
| MLM probability | 0.3 |
Data
Pretraining used the Portuguese subset of HPLT v3, pre-tokenized offline to MosaicML's MDS shard format: 339,102,441 train sequences (~166 GB at uint16). The dataset is licensed separately; consult the HPLT project page for terms.
Tokenizer
The tokenizer is
eliasjacob/ModernBERT-base-portuguese.
Its vocab of 50,280 was padded to 50,304 (multiple of 64) for tensor-core
alignment. The 24 padding rows on the MLM decoder carry a -1e4 bias so
argmax never lands on them.
The tokenizer does not emit token_type_ids. For sentence-pair tasks,
both halves are sent as type 0; the model still separates them via
[SEP].
Hardware
Pretraining ran on 8x NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB each).
NCCL_P2P_DISABLE=1 was required to work around broken peer-to-peer
transfers in the NCCL 2.26 bundled with torch==2.7.0+cu128 on
consumer Blackwell (sm_120).
Limitations
- Pretrained for masked language modeling only. No next-sentence prediction, no instruction tuning, no RLHF.
- The tokenizer omits
token_type_ids, a small handicap on sentence-pair tasks versus classical BERT. - Intended as a backbone for downstream fine-tuning, not for direct generation or zero-shot use.
License
Apache 2.0, matching upstream mosaicml/examples. The HPLT corpus is licensed separately and not redistributed here.
Citation
If you use this model, please cite both MosaicBERT and HPLT:
- MosaicML, MosaicBERT: Pretraining BERT from Scratch for $20, 2023.
- HPLT, HPLT: High-Performance Language Technologies.
- Downloads last month
- 71