ModernBERT-base / README.md

Update README.md

0bf7382 verified 4 months ago

4.12 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- sv
	- 'no'
	- da
	- is
	tags:
	- masked-lm
	- fill-mask
	- long-context
	- modernbert
	pipeline_tag: fill-mask
	inference: false
	base_model:
	- answerdotai/ModernBERT-base
	---
	## Overview
	This checkpoint continues the pre-training of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from [The Nordic Pile](https://arxiv.org/pdf/2303.17183) and [SWEb](https://arxiv.org/pdf/2410.04456) while preserving the original 8k token context window.

	This is a research artefact and is only intended for research purposes.

	Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens.

	The training is done in one stage with 8192 tokens per sample for the whole run.
	## Data Sources
	\| Corpus \| Size \| Selected Languages \| Highlights \|
	\|---\|---\|---\|---\|
	\| The Nordic Pile \| 1.2 TB raw text \| sv, no, da, is \| Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality \|
	\| SWEb \| 1 T+ tokens (~3.6 TB) \| sv, no, da, is \| 98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents \|
	## Training Setup
	\| Setting \| Value \|
	\|---\|---\|
	\| Parameters \| 150 M \|
	\| Context length \| 8 192 tokens (RoPE + local-global attention) \|
	\| Tokens processed \| 1.20 × 10<sup>12</sup> \|
	\| Tokens per batch \| 1 572 864 \|
	\| Global batch \| 192 sequences (micro-batch = 3) \|
	\| Optimizer & schedule \| Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) \|
	\| Precision \| AMP-bf16 \|
	\| Hardware \| 8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC LUMI-G system \|

	See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml)
	## Training Stats
	```python
	[token=1198511677292/1198510347252]:
	Train time/batch: 873585
	Train time/sample: 167728320
	Train time/batch_in_epoch: 3558
	Train time/sample_in_epoch: 683136
	Train time/token: 1198510256276
	Train time/token_in_epoch: 4882888303
	Train trainer/device_train_microbatch_size: 3
	Train loss/train/total: 0.9966
	Train throughput/batches_per_sec: 1.3117
	Train throughput/samples_per_sec: 251.8442
	Train throughput/device/batches_per_sec: 0.0205
	Train throughput/device/samples_per_sec: 3.9351
	Train throughput/tokens_per_sec: 1804244.5198
	Train throughput/device/tokens_per_sec: 28191.3206
	Train time/train: 184.5555
	Train time/val: 0.0000
	Train time/total: 184.5555
	Train lr-StableAdamW/group0: 0.0000
	Train lr-StableAdamW/group1: 0.0000
	```
	## Intended Use
	This is a research artefact and is only intended for research purposes.
	* Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).
	* Drop-in replacement for BERT-style encoders (omit `token_type_ids`).
	## Fill-mask
	```python
	from transformers import pipeline
	unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-base')
	unmasker("Huvudstaden i Sverige är [MASK].")
	```
	```python
	[{'score': 0.0629318505525589,
	'token': 2961,
	'token_str': ' Stockholm',
	'sequence': 'Huvudstaden i Sverige är Stockholm.'},
	{'score': 0.03635135293006897,
	'token': 49763,
	'token_str': 'awesome',
	'sequence': 'Huvudstaden i Sverige är awesome.'},
	{'score': 0.03006783314049244,
	'token': 751,
	'token_str': ' stor',
	'sequence': 'Huvudstaden i Sverige är stor.'},
	{'score': 0.029827557504177094,
	'token': 71,
	'token_str': 'a',
	'sequence': 'Huvudstaden i Sverige är a.'},
	{'score': 0.019739385694265366,
	'token': 79,
	'token_str': 'i',
	'sequence': 'Huvudstaden i Sverige är i.'}]
	```
	## Limitations & Biases
	* Web corpora can contain noise, stereotypes and sensitive content despite filtering.
	* RoPE extrapolation beyond 8 k tokens is untested and may degrade.
	## Code to reproduce
	* [Training](https://github.com/timpal0l/ModernBERT/tree/main/training)
	* [Data Processing](https://github.com/timpal0l/ModernBERT/tree/main/tokenizer)