Update README.md
Browse files
README.md
CHANGED
|
@@ -1,110 +1,93 @@
|
|
| 1 |
-
---
|
| 2 |
-
language: la
|
| 3 |
-
library_name: flair
|
| 4 |
-
license: cc-by-sa-4.0
|
| 5 |
-
tags:
|
| 6 |
-
- flair
|
| 7 |
-
- token-classification
|
| 8 |
-
- sequence-tagger
|
| 9 |
-
- latin
|
| 10 |
-
- medieval-latin
|
| 11 |
-
- legal-history
|
| 12 |
-
- lemmatization
|
| 13 |
-
- seq2seq
|
| 14 |
-
widget:
|
| 15 |
-
- text: "Et videtur, quod sic, quia res empta de pecunia pupilli efficitur"
|
| 16 |
-
---
|
| 17 |
-
|
| 18 |
-
# Latin Lemmatizer (Flair)
|
| 19 |
-
|
| 20 |
-
This model is a specialized **Sequence-to-Sequence (Seq2Seq)** Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.
|
| 21 |
-
|
| 22 |
-
The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
|
| 23 |
-
|
| 24 |
-
## Technical Details
|
| 25 |
-
|
| 26 |
-
- **Architecture:** Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
|
| 27 |
-
- **Hidden Size:** 2048 (4 layers).
|
| 28 |
-
- **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
|
| 29 |
-
- **Data Source:** ~1.59M sentences from medieval texts.
|
| 30 |
-
- **Beam Size:** 1.
|
| 31 |
-
|
| 32 |
-
## Data Source and Acknowledgements
|
| 33 |
-
We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.
|
| 34 |
-
|
| 35 |
-
## Evaluation
|
| 36 |
-
Token-level **exact-match lemma accuracy** on the held-out test split is **95.93%** (~4.6M tokens; 199,037 sentences).
|
| 37 |
-
This score is computed with `flair.models.Lemmatizer.evaluate()` and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.
|
| 38 |
-
|
| 39 |
-
### Important notes regarding evaluation
|
| 40 |
-
|
| 41 |
-
- **Micro-F1 vs. accuracy:** Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each *unique lemma string* as a "class". In this setting, **micro-F1 ~ accuracy** by construction, while **macro-F1 is typically much lower** due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
|
| 42 |
-
- **Filtered tokens:** Tokens with missing/invalid gold lemmas (e.g., `None` or `<UNK>`) are **excluded** by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
|
| 43 |
-
- **Tokenization conventions:** The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
|
| 44 |
-
- **Decoding and length limits:** Decoding is **greedy** (`beam_size=1`), and very long tokens may be affected by the model's maximum sequence length settings.
|
| 45 |
-
- **Domain shift:** Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
### Performance Metrics
|
| 49 |
-
|
| 50 |
-
The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).
|
| 51 |
-
|
| 52 |
-
| Metric | Score |
|
| 53 |
-
| :--- | :--- |
|
| 54 |
-
| **Token exact-match accuracy** | **95.93%** |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
## Usage
|
| 58 |
-
|
| 59 |
-
You can use this model with the [Flair](https://github.com/flairNLP/flair) library.
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
title = "Latin Lemmatizer (Flair)",
|
| 94 |
-
year = "2026",
|
| 95 |
-
publisher = "Zenodo",
|
| 96 |
-
doi = "10.5281/zenodo.18632650",
|
| 97 |
-
url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
|
| 98 |
-
}
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
```bibtex
|
| 102 |
-
@inproceedings{akbik-etal-2018-contextual,
|
| 103 |
-
title = "Contextual String Embeddings for Sequence Labeling",
|
| 104 |
-
author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
|
| 105 |
-
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
|
| 106 |
-
year = "2018",
|
| 107 |
-
pages = "1638--1649",
|
| 108 |
-
publisher = "Association for Computational Linguistics"
|
| 109 |
-
}
|
| 110 |
```
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: la
|
| 3 |
+
library_name: flair
|
| 4 |
+
license: cc-by-sa-4.0
|
| 5 |
+
tags:
|
| 6 |
+
- flair
|
| 7 |
+
- token-classification
|
| 8 |
+
- sequence-tagger
|
| 9 |
+
- latin
|
| 10 |
+
- medieval-latin
|
| 11 |
+
- legal-history
|
| 12 |
+
- lemmatization
|
| 13 |
+
- seq2seq
|
| 14 |
+
widget:
|
| 15 |
+
- text: "Et videtur, quod sic, quia res empta de pecunia pupilli efficitur"
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# Latin Lemmatizer (Flair)
|
| 19 |
+
|
| 20 |
+
This model is a specialized **Sequence-to-Sequence (Seq2Seq)** Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.
|
| 21 |
+
|
| 22 |
+
The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
|
| 23 |
+
|
| 24 |
+
## Technical Details
|
| 25 |
+
|
| 26 |
+
- **Architecture:** Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
|
| 27 |
+
- **Hidden Size:** 2048 (4 layers).
|
| 28 |
+
- **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
|
| 29 |
+
- **Data Source:** ~1.59M sentences from medieval texts.
|
| 30 |
+
- **Beam Size:** 1.
|
| 31 |
+
|
| 32 |
+
## Data Source and Acknowledgements
|
| 33 |
+
We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.
|
| 34 |
+
|
| 35 |
+
## Evaluation
|
| 36 |
+
Token-level **exact-match lemma accuracy** on the held-out test split is **95.93%** (~4.6M tokens; 199,037 sentences).
|
| 37 |
+
This score is computed with `flair.models.Lemmatizer.evaluate()` and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.
|
| 38 |
+
|
| 39 |
+
### Important notes regarding evaluation
|
| 40 |
+
|
| 41 |
+
- **Micro-F1 vs. accuracy:** Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each *unique lemma string* as a "class". In this setting, **micro-F1 ~ accuracy** by construction, while **macro-F1 is typically much lower** due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
|
| 42 |
+
- **Filtered tokens:** Tokens with missing/invalid gold lemmas (e.g., `None` or `<UNK>`) are **excluded** by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
|
| 43 |
+
- **Tokenization conventions:** The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
|
| 44 |
+
- **Decoding and length limits:** Decoding is **greedy** (`beam_size=1`), and very long tokens may be affected by the model's maximum sequence length settings.
|
| 45 |
+
- **Domain shift:** Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
### Performance Metrics
|
| 49 |
+
|
| 50 |
+
The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).
|
| 51 |
+
|
| 52 |
+
| Metric | Score |
|
| 53 |
+
| :--- | :--- |
|
| 54 |
+
| **Token exact-match accuracy** | **95.93%** |
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
## Usage
|
| 58 |
+
|
| 59 |
+
You can use this model with the [Flair](https://github.com/flairNLP/flair) library.
|
| 60 |
+
|
| 61 |
+
See notebook in this repo.
|
| 62 |
+
|
| 63 |
+
## Training Configuration
|
| 64 |
+
* Learning Rate: 0.05
|
| 65 |
+
* Mini Batch Size: 768 (with AMP enabled)
|
| 66 |
+
* Max Epochs: 15
|
| 67 |
+
* Optimizer: Standard SGD (with Flair's ModelTrainer)
|
| 68 |
+
* Character Dictionary: Custom-built covering the Latin alphabet and special diplomatic characters.
|
| 69 |
+
|
| 70 |
+
## Citation
|
| 71 |
+
If you use this model, please cite the specific model DOI and the Flair framework:
|
| 72 |
+
|
| 73 |
+
```bibtex
|
| 74 |
+
@software{schonhardt_michael_2026_latin_lemma,
|
| 75 |
+
author = "Schonhardt, Michael",
|
| 76 |
+
title = "Latin Lemmatizer (Flair)",
|
| 77 |
+
year = "2026",
|
| 78 |
+
publisher = "Zenodo",
|
| 79 |
+
doi = "10.5281/zenodo.18632650",
|
| 80 |
+
url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
|
| 81 |
+
}
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
```bibtex
|
| 85 |
+
@inproceedings{akbik-etal-2018-contextual,
|
| 86 |
+
title = "Contextual String Embeddings for Sequence Labeling",
|
| 87 |
+
author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
|
| 88 |
+
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
|
| 89 |
+
year = "2018",
|
| 90 |
+
pages = "1638--1649",
|
| 91 |
+
publisher = "Association for Computational Linguistics"
|
| 92 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
```
|