Fill-Mask
Transformers
Safetensors
modernbert
masked-lm
long-context
File size: 4,124 Bytes
797d765
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ababe2
 
797d765
 
 
 
 
 
 
 
 
 
 
625922a
797d765
 
 
 
 
 
 
 
 
 
 
74af928
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
797d765
 
 
 
 
 
 
 
 
 
 
 
0bf7382
797d765
 
 
0bf7382
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
797d765
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
library_name: transformers
license: apache-2.0
language:
- sv
- 'no'
- da
- is
tags:
- masked-lm
- fill-mask
- long-context
- modernbert
pipeline_tag: fill-mask
inference: false
base_model:
- answerdotai/ModernBERT-base
---
## Overview  
This checkpoint continues the pre-training of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on Scandinavian text, extending the model’s knowledge with ~1.2 trillion additional masked-language-model (MLM) tokens drawn from [The Nordic Pile](https://arxiv.org/pdf/2303.17183) and [SWEb](https://arxiv.org/pdf/2410.04456) while preserving the original 8k token context window. 

This is a **research artefact** and is only intended for **research purposes**.

Our tokenizer is trained from scratch on a subset of 11 985 103 472 tokens. 

The training is done in one stage with 8192 tokens per sample for the whole run.
## Data Sources  
| Corpus | Size | Selected Languages | Highlights |
|---|---|---|---|
| **The Nordic Pile** | 1.2 TB raw text | sv, no, da, is | Nine diverse categories (CC, Wikipedia, Books, Code, etc.), filtered and deduplicated for high quality |
| **SWEb** | 1 T+ tokens (~3.6 TB) | sv, no, da, is | 98 Common-Crawl snapshots with model-based HTML extraction; 1.2 B documents |
## Training Setup  
| Setting | Value |
|---|---|
| Parameters | 150 M |
| Context length | 8 192 tokens (RoPE + local-global attention) |
| Tokens processed | 1.20 × 10<sup>12</sup> |
| Tokens per batch | 1 572 864 |
| Global batch | 192 sequences (micro-batch = 3) |
| Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) |
| Precision | AMP-bf16 |
| Hardware | 8 nodes × 8 AMD MI250X GPUs (64 GPUs) on the EuroHPC **LUMI-G** system |

See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml)
## Training Stats
```python
[token=1198511677292/1198510347252]:
  Train time/batch: 873585
  Train time/sample: 167728320
  Train time/batch_in_epoch: 3558
  Train time/sample_in_epoch: 683136
  Train time/token: 1198510256276
  Train time/token_in_epoch: 4882888303
  Train trainer/device_train_microbatch_size: 3
  Train loss/train/total: 0.9966
  Train throughput/batches_per_sec: 1.3117
  Train throughput/samples_per_sec: 251.8442
  Train throughput/device/batches_per_sec: 0.0205
  Train throughput/device/samples_per_sec: 3.9351
  Train throughput/tokens_per_sec: 1804244.5198
  Train throughput/device/tokens_per_sec: 28191.3206
  Train time/train: 184.5555
  Train time/val: 0.0000
  Train time/total: 184.5555
  Train lr-StableAdamW/group0: 0.0000
  Train lr-StableAdamW/group1: 0.0000
```
## Intended Use  
This is a **research artefact** and is only intended for **research purposes**.
* Fill-mask inference, embedding extraction and fine-tuning for Scandinavian downstream NLP tasks (classification, NER, QA, etc.).  
* Drop-in replacement for BERT-style encoders (omit `token_type_ids`).
## Fill-mask
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='AI-Sweden-Models/ModernBERT-base')
unmasker("Huvudstaden i Sverige är [MASK].")
```
```python
[{'score': 0.0629318505525589,
  'token': 2961,
  'token_str': ' Stockholm',
  'sequence': 'Huvudstaden i Sverige är  Stockholm.'},
 {'score': 0.03635135293006897,
  'token': 49763,
  'token_str': 'awesome',
  'sequence': 'Huvudstaden i Sverige är awesome.'},
 {'score': 0.03006783314049244,
  'token': 751,
  'token_str': ' stor',
  'sequence': 'Huvudstaden i Sverige är  stor.'},
 {'score': 0.029827557504177094,
  'token': 71,
  'token_str': 'a',
  'sequence': 'Huvudstaden i Sverige är a.'},
 {'score': 0.019739385694265366,
  'token': 79,
  'token_str': 'i',
  'sequence': 'Huvudstaden i Sverige är i.'}]
```
## Limitations & Biases  
* Web corpora can contain noise, stereotypes and sensitive content despite filtering.   
* RoPE extrapolation beyond 8 k tokens is untested and may degrade.
## Code to reproduce
* [Training](https://github.com/timpal0l/ModernBERT/tree/main/training)
* [Data Processing](https://github.com/timpal0l/ModernBERT/tree/main/tokenizer)