mschonhardt commited on
Commit
e56677f
·
verified ·
1 Parent(s): 836c049

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -109
README.md CHANGED
@@ -1,110 +1,93 @@
1
- ---
2
- language: la
3
- library_name: flair
4
- license: cc-by-sa-4.0
5
- tags:
6
- - flair
7
- - token-classification
8
- - sequence-tagger
9
- - latin
10
- - medieval-latin
11
- - legal-history
12
- - lemmatization
13
- - seq2seq
14
- widget:
15
- - text: "Et videtur, quod sic, quia res empta de pecunia pupilli efficitur"
16
- ---
17
-
18
- # Latin Lemmatizer (Flair)
19
-
20
- This model is a specialized **Sequence-to-Sequence (Seq2Seq)** Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.
21
-
22
- The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
23
-
24
- ## Technical Details
25
-
26
- - **Architecture:** Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
27
- - **Hidden Size:** 2048 (4 layers).
28
- - **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
29
- - **Data Source:** ~1.59M sentences from medieval texts.
30
- - **Beam Size:** 1.
31
-
32
- ## Data Source and Acknowledgements
33
- We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.
34
-
35
- ## Evaluation
36
- Token-level **exact-match lemma accuracy** on the held-out test split is **95.93%** (~4.6M tokens; 199,037 sentences).
37
- This score is computed with `flair.models.Lemmatizer.evaluate()` and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.
38
-
39
- ### Important notes regarding evaluation
40
-
41
- - **Micro-F1 vs. accuracy:** Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each *unique lemma string* as a "class". In this setting, **micro-F1 ~ accuracy** by construction, while **macro-F1 is typically much lower** due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
42
- - **Filtered tokens:** Tokens with missing/invalid gold lemmas (e.g., `None` or `<UNK>`) are **excluded** by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
43
- - **Tokenization conventions:** The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
44
- - **Decoding and length limits:** Decoding is **greedy** (`beam_size=1`), and very long tokens may be affected by the model's maximum sequence length settings.
45
- - **Domain shift:** Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).
46
-
47
-
48
- ### Performance Metrics
49
-
50
- The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).
51
-
52
- | Metric | Score |
53
- | :--- | :--- |
54
- | **Token exact-match accuracy** | **95.93%** |
55
-
56
-
57
- ## Usage
58
-
59
- You can use this model with the [Flair](https://github.com/flairNLP/flair) library.
60
-
61
- ```python
62
- from flair.models import Lemmatizer
63
- from flair.data import Sentence
64
-
65
- # Load the model
66
- tagger = Lemmatizer.load('mschonhardt/latin-lemmatizer')
67
-
68
- # Create a sentence
69
- sentence = Sentence("Et videtur , quod sic , quia res empta de pecunia pupilli efficitur")
70
-
71
- # Predict lemmas
72
- tagger.predict(sentence)
73
-
74
- # Print results
75
- for token in sentence:
76
- lemma = token.get_label("lemma").value
77
- print(f"{token.text} -> {lemma}")
78
- ```
79
-
80
- ## Training Configuration
81
- * Learning Rate: 0.05
82
- * Mini Batch Size: 768 (with AMP enabled)
83
- * Max Epochs: 15
84
- * Optimizer: Standard SGD (with Flair's ModelTrainer)
85
- * Character Dictionary: Custom-built covering the Latin alphabet and special diplomatic characters.
86
-
87
- ## Citation
88
- If you use this model, please cite the specific model DOI and the Flair framework:
89
-
90
- ```bibtex
91
- @software{schonhardt_michael_2026_latin_lemma,
92
- author = "Schonhardt, Michael",
93
- title = "Latin Lemmatizer (Flair)",
94
- year = "2026",
95
- publisher = "Zenodo",
96
- doi = "10.5281/zenodo.18632650",
97
- url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
98
- }
99
- ```
100
-
101
- ```bibtex
102
- @inproceedings{akbik-etal-2018-contextual,
103
- title = "Contextual String Embeddings for Sequence Labeling",
104
- author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
105
- booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
106
- year = "2018",
107
- pages = "1638--1649",
108
- publisher = "Association for Computational Linguistics"
109
- }
110
  ```
 
1
+ ---
2
+ language: la
3
+ library_name: flair
4
+ license: cc-by-sa-4.0
5
+ tags:
6
+ - flair
7
+ - token-classification
8
+ - sequence-tagger
9
+ - latin
10
+ - medieval-latin
11
+ - legal-history
12
+ - lemmatization
13
+ - seq2seq
14
+ widget:
15
+ - text: "Et videtur, quod sic, quia res empta de pecunia pupilli efficitur"
16
+ ---
17
+
18
+ # Latin Lemmatizer (Flair)
19
+
20
+ This model is a specialized **Sequence-to-Sequence (Seq2Seq)** Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.
21
+
22
+ The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
23
+
24
+ ## Technical Details
25
+
26
+ - **Architecture:** Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
27
+ - **Hidden Size:** 2048 (4 layers).
28
+ - **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
29
+ - **Data Source:** ~1.59M sentences from medieval texts.
30
+ - **Beam Size:** 1.
31
+
32
+ ## Data Source and Acknowledgements
33
+ We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from Monumenta Germaniae Historica, Corpus Corporum and IRHT.
34
+
35
+ ## Evaluation
36
+ Token-level **exact-match lemma accuracy** on the held-out test split is **95.93%** (~4.6M tokens; 199,037 sentences).
37
+ This score is computed with `flair.models.Lemmatizer.evaluate()` and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.
38
+
39
+ ### Important notes regarding evaluation
40
+
41
+ - **Micro-F1 vs. accuracy:** Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each *unique lemma string* as a "class". In this setting, **micro-F1 ~ accuracy** by construction, while **macro-F1 is typically much lower** due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
42
+ - **Filtered tokens:** Tokens with missing/invalid gold lemmas (e.g., `None` or `<UNK>`) are **excluded** by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
43
+ - **Tokenization conventions:** The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
44
+ - **Decoding and length limits:** Decoding is **greedy** (`beam_size=1`), and very long tokens may be affected by the model's maximum sequence length settings.
45
+ - **Domain shift:** Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).
46
+
47
+
48
+ ### Performance Metrics
49
+
50
+ The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).
51
+
52
+ | Metric | Score |
53
+ | :--- | :--- |
54
+ | **Token exact-match accuracy** | **95.93%** |
55
+
56
+
57
+ ## Usage
58
+
59
+ You can use this model with the [Flair](https://github.com/flairNLP/flair) library.
60
+
61
+ See notebook in this repo.
62
+
63
+ ## Training Configuration
64
+ * Learning Rate: 0.05
65
+ * Mini Batch Size: 768 (with AMP enabled)
66
+ * Max Epochs: 15
67
+ * Optimizer: Standard SGD (with Flair's ModelTrainer)
68
+ * Character Dictionary: Custom-built covering the Latin alphabet and special diplomatic characters.
69
+
70
+ ## Citation
71
+ If you use this model, please cite the specific model DOI and the Flair framework:
72
+
73
+ ```bibtex
74
+ @software{schonhardt_michael_2026_latin_lemma,
75
+ author = "Schonhardt, Michael",
76
+ title = "Latin Lemmatizer (Flair)",
77
+ year = "2026",
78
+ publisher = "Zenodo",
79
+ doi = "10.5281/zenodo.18632650",
80
+ url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
81
+ }
82
+ ```
83
+
84
+ ```bibtex
85
+ @inproceedings{akbik-etal-2018-contextual,
86
+ title = "Contextual String Embeddings for Sequence Labeling",
87
+ author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
88
+ booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
89
+ year = "2018",
90
+ pages = "1638--1649",
91
+ publisher = "Association for Computational Linguistics"
92
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ```