mschonhardt
/

latin-lemmatizer

@@ -1,110 +1,93 @@
----
-language: la
-library_name: flair
-license: cc-by-sa-4.0
-tags:
-- flair
-- token-classification
-- sequence-tagger
-- latin
-- medieval-latin
-- legal-history
-- lemmatization
-- seq2seq
-widget:
-- text: "Et videtur, quod sic, quia res empta de pecunia pupilli efficitur"
----
-# Latin Lemmatizer (Flair)
-This model is a specialized **Sequence-to-Sequence (Seq2Seq)** Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.
-The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
-## Technical Details
-- **Architecture:** Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
-- **Hidden Size:** 2048 (4 layers).
-- **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
-- **Data Source:** ~1.59M sentences from medieval texts.
-- **Beam Size:** 1.
-## Data Source and Acknowledgements
-We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from  Monumenta Germaniae Historica, Corpus Corporum and IRHT.
-## Evaluation
-Token-level **exact-match lemma accuracy** on the held-out test split is **95.93%** (~4.6M tokens; 199,037 sentences).
-This score is computed with `flair.models.Lemmatizer.evaluate()` and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.
-### Important notes regarding evaluation
-- **Micro-F1 vs. accuracy:** Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each *unique lemma string* as a "class". In this setting, **micro-F1 ~ accuracy** by construction, while **macro-F1 is typically much lower** due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
-- **Filtered tokens:** Tokens with missing/invalid gold lemmas (e.g., `None` or `<UNK>`) are **excluded** by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
-- **Tokenization conventions:** The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
-- **Decoding and length limits:** Decoding is **greedy** (`beam_size=1`), and very long tokens may be affected by the model's maximum sequence length settings.
-- **Domain shift:** Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).
-### Performance Metrics
-The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).
-| Metric | Score |
-| :--- | :--- |
-| **Token exact-match accuracy** | **95.93%** |
-## Usage
-You can use this model with the [Flair](https://github.com/flairNLP/flair) library.
-```python
-from flair.models import Lemmatizer
-from flair.data import Sentence
-# Load the model
-tagger = Lemmatizer.load('mschonhardt/latin-lemmatizer')
-# Create a sentence
-sentence = Sentence("Et videtur , quod sic , quia res empta de pecunia pupilli efficitur")
-# Predict lemmas
-tagger.predict(sentence)
-# Print results
-for token in sentence:
-    lemma = token.get_label("lemma").value
-    print(f"{token.text} -> {lemma}")
-```
-## Training Configuration
-* Learning Rate: 0.05
-* Mini Batch Size: 768 (with AMP enabled)
-* Max Epochs: 15
-* Optimizer: Standard SGD (with Flair's ModelTrainer)
-* Character Dictionary: Custom-built covering the Latin alphabet and special diplomatic characters.
-## Citation
-If you use this model, please cite the specific model DOI and the Flair framework:
-```bibtex
-@software{schonhardt_michael_2026_latin_lemma,
-  author = "Schonhardt, Michael",
-  title = "Latin Lemmatizer (Flair)",
-  year = "2026",
-  publisher = "Zenodo",
-  doi = "10.5281/zenodo.18632650",
-  url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
-}
-```
-```bibtex
-@inproceedings{akbik-etal-2018-contextual,
-    title = "Contextual String Embeddings for Sequence Labeling",
-    author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
-    booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
-    year = "2018",
-    pages = "1638--1649",
-    publisher = "Association for Computational Linguistics"
-}
 ```

+---
+language: la
+library_name: flair
+license: cc-by-sa-4.0
+tags:
+- flair
+- token-classification
+- sequence-tagger
+- latin
+- medieval-latin
+- legal-history
+- lemmatization
+- seq2seq
+widget:
+- text: "Et videtur, quod sic, quia res empta de pecunia pupilli efficitur"
+---
+# Latin Lemmatizer (Flair)
+This model is a specialized **Sequence-to-Sequence (Seq2Seq)** Lemmatizer for Latin. Unlike simple lookup-based lemmatizers, this model uses an encoder-decoder architecture with attention to "translate" inflected Latin word forms into their dictionary headwords (lemmas), making it highly effective for the complex morphology of medieval texts.
+The model was developed as part of the projects **"Embedding the Past"** (LOEWE-Exploration, TU Darmstadt) and **"Burchards Dekret Digital"** (Langzeitvorhaben, Akademie der Wissenschaften und der Literatur | Mainz).
+## Technical Details
+- **Architecture:** Seq2Seq Lemmatizer (RNN-based encoder-decoder with attention as implemented by Flair Lemmatizer).
+- **Hidden Size:** 2048 (4 layers).
+- **Base Embeddings:** Stacked [Latin Legal Forward](https://huggingface.co/mschonhardt/latin-legal-forward) and [Backward](https://huggingface.co/mschonhardt/latin-legal-backward) contextual string embeddings.
+- **Data Source:** ~1.59M sentences from medieval texts.
+- **Beam Size:** 1.
+## Data Source and Acknowledgements
+We gratefully acknowledge that the training data originates from the **[Latin Text Archive (LTA)](http://lta.bbaw.de)** (**Prof. Dr. Bernhard Jussen**, **Dr. Tim Geelhaar**) including data from  Monumenta Germaniae Historica, Corpus Corporum and IRHT.
+## Evaluation
+Token-level **exact-match lemma accuracy** on the held-out test split is **95.93%** (~4.6M tokens; 199,037 sentences).
+This score is computed with `flair.models.Lemmatizer.evaluate()` and corresponds to the proportion of tokens where the predicted lemma string exactly matches the gold lemma string.
+### Important notes regarding evaluation
+- **Micro-F1 vs. accuracy:** Flair's evaluation report shows micro-/macro-F1. For lemmatization, these are derived by treating each *unique lemma string* as a "class". In this setting, **micro-F1 ~ accuracy** by construction, while **macro-F1 is typically much lower** due to the extremely long-tailed lemma inventory and is not the primary lemmatization metric.
+- **Filtered tokens:** Tokens with missing/invalid gold lemmas (e.g., `None` or `<UNK>`) are **excluded** by the dataset loader. This can make results slightly optimistic compared to evaluating on the raw, unfiltered token stream.
+- **Tokenization conventions:** The model is sensitive to the tokenization used during training (e.g., punctuation separated by spaces). Different tokenization may reduce accuracy.
+- **Decoding and length limits:** Decoding is **greedy** (`beam_size=1`), and very long tokens may be affected by the model's maximum sequence length settings.
+- **Domain shift:** Trained on medieval Latin. Performance may drop on texts with different orthography/lexicon (e.g., classical poetry, heavily abbreviated editions, mixed-language passages, etc.).
+### Performance Metrics
+The model was evaluated on a test set of 199,037 sentences (~4.6M tokens).
+| Metric | Score |
+| :--- | :--- |
+| **Token exact-match accuracy** | **95.93%** |
+## Usage
+You can use this model with the [Flair](https://github.com/flairNLP/flair) library.
+See notebook in this repo.
+## Training Configuration
+* Learning Rate: 0.05
+* Mini Batch Size: 768 (with AMP enabled)
+* Max Epochs: 15
+* Optimizer: Standard SGD (with Flair's ModelTrainer)
+* Character Dictionary: Custom-built covering the Latin alphabet and special diplomatic characters.
+## Citation
+If you use this model, please cite the specific model DOI and the Flair framework:
+```bibtex
+@software{schonhardt_michael_2026_latin_lemma,
+  author = "Schonhardt, Michael",
+  title = "Latin Lemmatizer (Flair)",
+  year = "2026",
+  publisher = "Zenodo",
+  doi = "10.5281/zenodo.18632650",
+  url = "https://huggingface.co/mschonhardt/latin-lemmatizer"
+}
+```
+```bibtex
+@inproceedings{akbik-etal-2018-contextual,
+    title = "Contextual String Embeddings for Sequence Labeling",
+    author = "Akbik, Alan and Blythe, Duncan and Vollgraf, Roland",
+    booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
+    year = "2018",
+    pages = "1638--1649",
+    publisher = "Association for Computational Linguistics"
+}
 ```