ltg
/

norbert4-xlarge

@@ -20,6 +20,9 @@ license: apache-2.0
 <img src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
 The fourth generation of NorBERT models mainly improves their efficiency, but also performance and flexibility.
 - **Made to encode long texts**: these models were trained on 16384-token-long texts, the sliding-window attention can then generalize to even longer sequences.
 - **Fast and memory-efficient training and inference**: using FlashAttention2 with unpadding, the new generation of NorBERT models can process the long texts with ease.
 - **Better performance**: better quality of training corpora and carefully tuned training settings leads to an improved performance over NorBERT 3.
@@ -30,7 +33,6 @@ The fourth generation of NorBERT models mainly improves their efficiency, but al
 > [!TIP]
 > We recommend installing Flash Attention 2 and `torch.compile`-ing your models to get the highest training and inference efficiency.
-<img src="https://huggingface.co/ltg/norbert4-xlarge/resolve/main/model_performance.png" width=100%>
 ## All sizes of the NorBERT4 family:
@@ -50,8 +52,13 @@ import torch
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 # Import model
-tokenizer = AutoTokenizer.from_pretrained("ltg/norbert4-xlarge")
-model = AutoModelForMaskedLM.from_pretrained("ltg/norbert4-xlarge", trust_remote_code=True)
 # Tokenize text (with a mask token inside)
 input_text = tokenizer(
@@ -83,8 +90,13 @@ import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # Import model
-tokenizer = AutoTokenizer.from_pretrained("ltg/norbert4-xlarge")
-model = AutoModelForCausalLM.from_pretrained("ltg/norbert4-xlarge", trust_remote_code=True)
 # Define zero-shot translation prompt template
 prompt = """Engelsk: {0}

 <img src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
 The fourth generation of NorBERT models mainly improves their efficiency, but also performance and flexibility.
+<img src="https://huggingface.co/ltg/norbert4-xlarge/resolve/main/model_performance.png" width=100%>
 - **Made to encode long texts**: these models were trained on 16384-token-long texts, the sliding-window attention can then generalize to even longer sequences.
 - **Fast and memory-efficient training and inference**: using FlashAttention2 with unpadding, the new generation of NorBERT models can process the long texts with ease.
 - **Better performance**: better quality of training corpora and carefully tuned training settings leads to an improved performance over NorBERT 3.
 > [!TIP]
 > We recommend installing Flash Attention 2 and `torch.compile`-ing your models to get the highest training and inference efficiency.
 ## All sizes of the NorBERT4 family:
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 # Import model
+tokenizer = AutoTokenizer.from_pretrained(
+    "ltg/norbert4-xlarge"
+)
+model = AutoModelForMaskedLM.from_pretrained(
+    "ltg/norbert4-xlarge",
+    trust_remote_code=True
+)
 # Tokenize text (with a mask token inside)
 input_text = tokenizer(
 from transformers import AutoTokenizer, AutoModelForCausalLM
 # Import model
+tokenizer = AutoTokenizer.from_pretrained(
+    "ltg/norbert4-xlarge"
+)
+model = AutoModelForCausalLM.from_pretrained(
+    "ltg/norbert4-xlarge",
+    trust_remote_code=True
+)
 # Define zero-shot translation prompt template
 prompt = """Engelsk: {0}