davda54 commited on
Commit
0261de6
·
verified ·
1 Parent(s): 26cf69b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -3
README.md CHANGED
@@ -1,3 +1,169 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - 'no'
4
+ - nb
5
+ - nn
6
+ - se
7
+ inference: false
8
+ tags:
9
+ - BERT
10
+ - GPT-BERT
11
+ - NorBERT
12
+ - Norwegian
13
+ - encoder
14
+ - decoder
15
+ license: apache-2.0
16
+ ---
17
+
18
+ # NorBERT 4 xlarge
19
+
20
+ <img src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
21
+
22
+ The fourth generation of NorBERT models mainly improves their efficiency, but also performance and flexibility.
23
+ - **Made to encode long texts**: these models were trained on 16384-token-long texts, the sliding-window attention can then generalize to even longer sequences.
24
+ - **Fast and memory-efficient training and inference**: using FlashAttention2 with unpadding, the new generation of NorBERT models can process the long texts with ease.
25
+ - **Better performance**: better quality of training corpora and carefully tuned training settings leads to an improved performance over NorBERT 3.
26
+ - **BERT as well as GPT**: the models can flexibly function as both bidirectional encoders (BERT) or unidirectional decoders (GPT), which makes them very flexible to any downstream use.
27
+ - **Trained from scratch**: the model is trained from scratch on 600B tokens of Norwegian Bokmål, Nynorsk and Northern Sámi. We used the HPLT 2.0 corpus, FineWeb2 and Mímir Core.
28
+ - **Permissable license**: the checkpoints are distributed freely under Apache 2.0, anyone can use our models.
29
+
30
+ > [!TIP]
31
+ > We recommend installing Flash Attention 2 and `torch.compile`-ing your models to get the highest training and inference efficiency.
32
+
33
+ <img src="https://huggingface.co/ltg/norbert4-xlarge/resolve/main/model_performance.png" width=100%>
34
+
35
+
36
+ ## All sizes of the NorBERT4 family:
37
+ - [NorBERT 4 xsmall (17M)](https://huggingface.co/ltg/norbert4-xsmall)
38
+ - [NorBERT 4 small (40M)](https://huggingface.co/ltg/norbert4-small)
39
+ - [NorBERT 4 base (149M)](https://huggingface.co/ltg/norbert4-base)
40
+ - [NorBERT 4 large (360M)](https://huggingface.co/ltg/norbert4-large)
41
+ - [NorBERT 4 xlarge (987M)](https://huggingface.co/ltg/norbert4-xlarge)
42
+
43
+
44
+ ## Example usage (bidirectional encoding)
45
+
46
+ This model currently needs a custom wrapper from `modeling_norbert.py`, you should therefore load the model with `trust_remote_code=True`.
47
+
48
+ ```python
49
+ import torch
50
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
51
+
52
+ # Import model
53
+ tokenizer = AutoTokenizer.from_pretrained("ltg/norbert4-xlarge")
54
+ model = AutoModelForMaskedLM.from_pretrained("ltg/norbert4-xlarge", trust_remote_code=True)
55
+
56
+ # Tokenize text (with a mask token inside)
57
+ input_text = tokenizer(
58
+ f"Nå ønsker de seg en{tokenizer.mask_token} bolig.",
59
+ return_tensors="pt"
60
+ )
61
+
62
+ # Inference
63
+ with torch.inference_mode:
64
+ output_p = model(**input_text)
65
+
66
+ # Unmask the text
67
+ output_text = torch.where(
68
+ input_text.input_ids == tokenizer.mask_token_id,
69
+ output_p.logits.argmax(-1),
70
+ input_text.input_ids
71
+ )
72
+
73
+ # Decoding; should output: '<s>Nå ønsker de seg en ny bolig.'
74
+ print(tokenizer.decode(output_text[0].tolist()))
75
+ ```
76
+
77
+ ## Example usage (text generation)
78
+
79
+ NorBERT now also supports unidirectional text decoding, it can generate text like any other GPT model:
80
+
81
+ ```python
82
+ import torch
83
+ from transformers import AutoTokenizer, AutoModelForCausalLM
84
+
85
+ # Import model
86
+ tokenizer = AutoTokenizer.from_pretrained("ltg/norbert4-xlarge")
87
+ model = AutoModelForCausalLM.from_pretrained("ltg/norbert4-xlarge", trust_remote_code=True)
88
+
89
+ # Define zero-shot translation prompt template
90
+ prompt = """Engelsk: {0}
91
+ Bokmål:"""
92
+
93
+ # Define tokens that should end the generation (any token with a newline)
94
+ eos_token_ids = [
95
+ token_id
96
+ for token_id in range(tokenizer.vocab_size)
97
+ if '\n' in tokenizer.decode([token_id])
98
+ ]
99
+
100
+ # Generation function
101
+ @torch.inference_mode()
102
+ def generate(text):
103
+ text = prompt.format(text)
104
+ input_ids = tokenizer(text, return_tensors='pt').input_ids
105
+ prediction = model.generate(
106
+ input_ids,
107
+ max_new_tokens=64,
108
+ do_sample=False,
109
+ eos_token_id=eos_token_ids
110
+ )
111
+ return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
112
+
113
+ # Example usage
114
+ generate("I'm a model that can generate text!")
115
+ ```
116
+
117
+ The following classes are currently implemented: `AutoModel`, `AutoModelMaskedLM`, `AutoModelForCausalLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoModelForQuestionAnswering` and `AutoModeltForMultipleChoice`.
118
+
119
+ ## Contact
120
+
121
+ David Samuel: `davisamu@uio.no`
122
+
123
+ ## Cite us
124
+
125
+ ```bibtex
126
+ @inproceedings{charpentier-samuel-2024-bert,
127
+ title = "{GPT} or {BERT}: why not both?",
128
+ author = "Charpentier, Lucas Georges Gabriel and
129
+ Samuel, David",
130
+ editor = "Hu, Michael Y. and
131
+ Mueller, Aaron and
132
+ Ross, Candace and
133
+ Williams, Adina and
134
+ Linzen, Tal and
135
+ Zhuang, Chengxu and
136
+ Choshen, Leshem and
137
+ Cotterell, Ryan and
138
+ Warstadt, Alex and
139
+ Wilcox, Ethan Gotlieb",
140
+ booktitle = "The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning",
141
+ month = nov,
142
+ year = "2024",
143
+ address = "Miami, FL, USA",
144
+ publisher = "Association for Computational Linguistics",
145
+ url = "https://aclanthology.org/2024.conll-babylm.24/",
146
+ pages = "262--283"
147
+ }
148
+ ```
149
+
150
+ ```bibtex
151
+ @inproceedings{samuel-etal-2023-norbench,
152
+ title = "{N}or{B}ench {--} A Benchmark for {N}orwegian Language Models",
153
+ author = "Samuel, David and
154
+ Kutuzov, Andrey and
155
+ Touileb, Samia and
156
+ Velldal, Erik and
157
+ {\O}vrelid, Lilja and
158
+ R{\o}nningstad, Egil and
159
+ Sigdel, Elina and
160
+ Palatkina, Anna",
161
+ booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
162
+ month = may,
163
+ year = "2023",
164
+ address = "T{\'o}rshavn, Faroe Islands",
165
+ publisher = "University of Tartu Library",
166
+ url = "https://aclanthology.org/2023.nodalida-1.61",
167
+ pages = "618--633"
168
+ }
169
+ ```