--- language: - da - no license: cc-by-4.0 datasets: - MiMe-MeMo/Corpus-v1.1 - MiMe-MeMo/Sentiment-v1 - MiMe-MeMo/WSD-Skaebne metrics: - f1 tags: - historical-texts - digital-humanities - sentiment-analysis - word-sense-disambiguation - danish - norwegian model-index: - name: MeMo-BERT-01 results: - task: type: text-classification name: Sentiment Analysis dataset: name: MiMe-MeMo/Sentiment-v1 type: text metrics: - name: f1 type: f1 value: 0.56 - task: type: text-classification name: Word Sense Disambiguation dataset: name: MiMe-MeMo/WSD-Skaebne type: text metrics: - name: f1 type: f1 value: 0.43 --- # MeMo-BERT-01 **MeMo-BERT-01** is a pre-trained language model for **historical Danish and Norwegian literary texts** (1870–1900). It was introduced in [Al-Laith et al. (2024)](https://aclanthology.org/2024.lrec-main.431/) as part of the first dedicated PLMs for historical Danish and Norwegian. ## Model Description - **Architecture:** BERT-base (12 layers, hidden size 768, 12 attention heads, vocab size 30k) - **Pre-training strategy:** Trained **from scratch** on the MeMo corpus (no prior pre-training) - **Training objective:** Masked Language Modeling (MLM, 15% masking) - **Training data:** MeMo Corpus v1.1 (839 novels, ~53M words, 1870–1900) - **Hardware:** 2 × A100 GPUs - **Training time:** ~44 hours This model represents the **baseline historical-domain model** trained entirely on 19th-century Scandinavian novels. ## Intended Use - **Primary tasks:** - Sentiment Analysis (positive, neutral, negative) - Word Sense Disambiguation (historical vs. modern senses of *skæbne*, "fate") - **Intended users:** - Researchers in Digital Humanities, Computational Linguistics, and Scandinavian Studies. - Historians of literature studying 19th-century Scandinavian novels. - **Not intended for:** - Contemporary Danish/Norwegian NLP tasks. - High-stakes applications (e.g., legal, medical, political decision-making). ## Training Data - **Corpus:** [MeMo Corpus v1.1](https://huggingface.co/datasets/MiMe-MeMo/Corpus-v1.1) (Bjerring-Hansen et al. 2022) - **Time period:** 1870–1900 - **Size:** 839 novels, 690 MB, 3.2M sentences, 52.7M words - **Preprocessing:** OCR-corrected, normalized to modern Danish spelling, tokenized, lemmatized, annotated ## Evaluation ### Benchmarks | Task | Dataset | Test F1 | Notes | |------|---------|---------|-------| | Sentiment Analysis | MiMe-MeMo/Sentiment-v1 | **0.56** | 3-class (pos/neg/neu) | | Word Sense Disambiguation | MiMe-MeMo/WSD-Skaebne | **0.43** | 4-class (pre-modern, modern, figure of speech, ambiguous) | ### Comparison MeMo-BERT-01 performs **worse than MeMo-BERT-03** (continued pre-training), highlighting the limitations of training from scratch on historical data without leveraging contemporary PLMs. ## Limitations - Trained **only from scratch** on ~53M words (relatively small for BERT training). - Underperforms compared to continued pre-training (MeMo-BERT-03). - Domain-specific to late 19th-century novels. - OCR and normalization errors may remain in training corpus. ## Ethical Considerations - All texts are **public domain** (authors deceased). - Datasets released under **CC BY 4.0**. - No sensitive personal data involved. ## Citation If you use this model, please cite: ```bibtex @inproceedings{al-laith-etal-2024-development, title = "Development and Evaluation of Pre-trained Language Models for Historical {D}anish and {N}orwegian Literary Texts", author = "Al-Laith, Ali and Conroy, Alexander and Bjerring-Hansen, Jens and Hershcovich, Daniel", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", pages = "4811--4819", url = "https://aclanthology.org/2024.lrec-main.431/" }