Update README.md

0aee268 verified about 1 month ago

1.79 kB

license: apache-2.0
datasets:
  - HPLT/HPLT3.0
  - allenai/olmo-mix-1124
  - HuggingFaceFW/finepdfs
  - HuggingFaceTB/finemath
  - LLM360/MegaMath
  - HuggingFaceTB/stack-edu
  - HuggingFaceFW/finepdfs-edu
language:
  - nb
  - nn
  - 'no'
base_model:
  - allenai/OLMo-2-1124-13B
library_name: transformers
tags:
  - norwegian
  - norsk
  - HPLT

This is a base (not instruction-tuned) large language model, continually pre-trained on Norwegian data starting from the English OLMo2-13B model.

Our training data mixture for the first stage (steps 0-24 000) included HPLTv3 Bokmål and Nynorsk as well as Faroese, Icelandic, Danish, and Swedish, FinePDF Bokmål and Nynorsk as well as Faroese, Icelandic, Danish, and Swedish, OLMo-Mix, Northern Sami dataset. For the second stage (steps 24 000-33 000), our training data mixture includes a filtered HPLTv3 Bokmål and Nynorsk as well as Faroese, Icelandic, Danish, and Swedish, FinePDF-Edu Bokmål and Nynorsk as well as Icelandic, Danish, Swedish, and English, FinePDF Faroese, Stack-Edu, MegaMath Web-Pro, FineMath 4+, InfiWebMath 4+, Northern Sami dataset The model was trained for 33 000 steps on around 300 billion tokens. Intermediate checkpoints are published here as branches.

Training was conducted as a part of the HPLT project.

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]