DWDSmor – German Morphology
DWDSmor implements the lemmatisation and morphosyntactic analysis of word forms as well as the generation of paradigms of lexical words in written German. DWDSmor’s finite state transducers (the DWDSmor automata) map word forms to specifications of corresponding lexical words and morphosyntactic categories. By traversing such transducers
- a given word form can be analysed and lemmatised, or
- a lexical word together with a set of morphosyntactic tags will generate corresponding inflected word forms.
The DWDSmor automata are compiled and traversed via SFST, a C++ library and toolbox for finite-state transducers (FSTs). In addition, a DWDSmor Python library is provided, using the SFST Python bindings.
The coverage of the DWDSmor automata of the German language depends on
- a DWDSmor lexicon, which declares lexical entries with lemmas, stem forms, word classes, inflection classes, etc., and
- the DWDSmor grammar, which defines lemmatisation, inflection, and word-formation rules for written German.
While the DWDSmor grammar for word-formation is still work in progress, its inflection grammar is pretty comprehensive. The inflection grammar as well as the lexicon format are based on (heavily modified) code from SMORLemma, which in turn is derived from the Stuttgart Morphology (SMOR).
As a rule, the entries in a DWDSmor lexicon are extracted from a source lexicon comprising a set of XML files in the format of the DWDS dictionary.
From a DWDSmor lexicon and the DWDSmor grammar, a DWDSmor edition can be
compiled, containing several automata types in standard (.a) for
generation and in compact format (.ca) for analysis:
lemma.{a,ca}: transducer with inflection and word-formation components, for lemmatisation and morphosyntactic analysis of word forms in terms of grammatical categories.morph.{a,ca}: transducer with inflection and word-formation components, for the generation of morphologically segmented word forms.finite.{a,ca}: transducer with an inflection component and a finite word-formation component, for testing purposes.root.{a,ca}: transducer with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022).index.{a,ca}: transducer with an inflection component only with DWDS homographic lemma indices, for paradigm generation.
Currently, the following DWDSmor editions are supported:
- the DWDS Edition, derived from the complete lexical dataset of the DWDS dictionary, and
- the Open Edition, based on a sample selection of DWDS lemmas and their grammatical specifications.
The automata of the DWDS Edition are available upon request for research purposes. The automata of the Open Edition, as well as its sample source lexicon, are released freely for general use and experiments. For testing purposes, DWDSmor also allows for compiling a development edition from a user-provided source lexicon.
The coverage of the released DWDSmor automata is measured against the German Universal Dependencies HDT treebank and reported on the Hugging Face Hub page. In the DWDS Edition, the coverage ratios typically range from 95 % to 100 % for most word classes; notable exceptions include foreign-language words and named entities, which are barely part of the underlying DWDS dictionary and thus poorly covered by DWDSmor. In the Open Edition, the coverage ratios of open word classes are lower, due to the limited size of the sample source lexicon.
Usage
The DWDSmor Python library is available via the Python Package Index (PyPI):
pip install dwdsmor
The library can be used for lemmatisation:
>>> import dwdsmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"V"}).analysis == "testen"
>>> assert lemmatizer("getestet", pos={"ADJ"}).analysis == "getestet"
There is also integration with spacy:
>>> import spacy
>>> import dwdsmor.spacy
>>> nlp = spacy.load("de_hdt_lg")
>>> nlp.add_pipe("dwdsmor")
<dwdsmor.spacy.Component object at 0x7f99e634f220>
>>> tuple((t.lemma_, t._.dwdsmor.analysis) for t in nlp("Das ist ein Test."))
(('der', 'die'), ('sein', 'sein'), ('ein', 'eine'), ('Test', 'Test'), ('.', '.'))
In addition to the Python API, the package provides a simple
command-line interface named dwdsmor. To analyze a word form, pass it
as an argument:
$ dwdsmor gebildet
Wordform Lemma Analysis POS Degree Function Nonfinite Tense Auxiliary
gebildet bilden bild<~>en<+V><Part><Perf><haben> V Part Perf haben
gebildet gebildet ge<~>bild<~>et<+ADJ><Pos><Pred/Adv> ADJ Pos Pred/Adv
To generate all word forms for a lexical word, pass it (or a form
which can be analyzed as the lexical word) as an argument together
with the option -g:
$ dwdsmor -g gebildet
[…]
Wordform Lemma Analysis POS Degree Function Person Gender Case Number Nonfinite Tense Mood Auxiliary Inflection
[…]
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St> ADJ Pos Attr/Subst Fem Acc Sg St
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk> ADJ Pos Attr/Subst Fem Acc Sg Wk
gebildeter gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St> ADJ Pos Attr/Subst Fem Dat Sg St
gebildeten gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk> ADJ Pos Attr/Subst Fem Dat Sg Wk
gebildeter gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St> ADJ Pos Attr/Subst Fem Gen Sg St
gebildeten gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk> ADJ Pos Attr/Subst Fem Gen Sg Wk
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><St> ADJ Pos Attr/Subst Fem Nom Sg St
gebildete gebildet gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><Wk> ADJ Pos Attr/Subst Fem Nom Sg Wk
gebildeten gebildet gebildet<+ADJ><Pos><Attr/Subst><Masc><Acc><Sg><St> ADJ Pos Attr/Subst Masc Acc Sg St
[…]
bildeten bilden bild<~>en<+V><1><Pl><Past><Ind> V 1 Pl Past Ind
bildeten bilden bild<~>en<+V><1><Pl><Past><Subj> V 1 Pl Past Subj
bilden bilden bild<~>en<+V><1><Pl><Pres><Ind> V 1 Pl Pres Ind
bilden bilden bild<~>en<+V><1><Pl><Pres><Subj> V 1 Pl Pres Subj
bildete bilden bild<~>en<+V><1><Sg><Past><Ind> V 1 Sg Past Ind
bildete bilden bild<~>en<+V><1><Sg><Past><Subj> V 1 Sg Past Subj
bilde bilden bild<~>en<+V><1><Sg><Pres><Ind> V 1 Sg Pres Ind
bilde bilden bild<~>en<+V><1><Sg><Pres><Subj> V 1 Sg Pres Subj
bildetet bilden bild<~>en<+V><2><Pl><Past><Ind> V 2 Pl Past Ind
[…]
More sophisticated tools for analysis and paradigm generation with the
DWDSmor Python library are provided by the Python scripts analysis.py
and paradigm.py in the tools/ subdirectory:
$ ./tools/analysis.py -h
usage: analysis.py [-h] [-a] [-c] [-C] [-d AUTOMATA_DIR] [-e] [-H] [-I] [-j]
[-m] [-M] [-P] [-s] [-S] [-t {root,finite,morph,index,lemma}]
[-T {root,finite,morph,index,lemma}] [-v] [-y]
[input] [output]
positional arguments:
input input file (one word form per line; default: stdin)
output output file (default: stdout)
options:
-h, --help show this help message and exit
-a, --analysis-string output also analysis string
-c, --csv output CSV table
-C, --force-color preserve color and formatting when piping output
-d AUTOMATA_DIR, --automata-dir AUTOMATA_DIR
automata directory (default: build/open)
-e, --empty show empty columns or values
-H, --no-header suppress table header
-I, --no-info do not show analyses with info tags
-j, --json output JSON object
-m, --minimal prefer lemmas with minimal number of boundaries
-M, --maximal prefer word forms with maximal number of boundaries (requires secondary automaton)
-P, --plain suppress color and formatting
-s, --seg-lemma output also segmented lemma
-S, --seg-word output also segmented word form (requires secondary automaton)
-t {root,finite,morph,index,lemma}, --automaton-type {root,finite,morph,index,lemma}
type of primary automaton (default: lemma)
-T {root,finite,morph,index,lemma}, --automaton2-type {root,finite,morph,index,lemma}
type of secondary automaton (default: morph)
-v, --version show program's version number and exit
-y, --yaml output YAML document
$ ./tools/paradigm.py -h
usage: paradigm.py [-h] [-c] [-C] [-d AUTOMATA_DIR] [-e] [-H]
[-i {1,2,3,4,5,6,7,8}] [-I {1,2,3,4,5,6,7,8}] [-j] [-n] [-N] [-o] [-O]
[-p {NN,NPROP,ADJ,CARD,ORD,FRAC,ART,DEM,INDEF,POSS,PPRO,REL,WPRO,V}]
[-P] [-s] [-S] [-t {lemma,morph,index,finite,root}] [-u] [-v] [-y]
lemma [output]
positional arguments:
lemma lemma (determiners: Fem Nom Sg; nominalised adjectives: Wk)
output output file (default: stdout)
options:
-h, --help show this help message and exit
-c, --csv output CSV table
-C, --force-color preserve color and formatting when piping output
-d AUTOMATA_DIR, --automata-dir AUTOMATA_DIR
automata directory (default: build/dwds)
-e, --empty show empty columns or values
-H, --no-header suppress table header
-i {1,2,3,4,5,6,7,8}, --lemma-index {1,2,3,4,5,6,7,8}
homographic lemma index
-I {1,2,3,4,5,6,7,8}, --paradigm-index {1,2,3,4,5,6,7,8}
paradigm index
-j, --json output JSON object
-n, --no-cats do not output category names
-N, --no-lemma do not output lemma, lemma index, paradigm index, and lexical categories
-o, --old output also archaic forms
-O, --oldorth output also forms in old spelling
-p {NN,NPROP,ADJ,CARD,ORD,FRAC,ART,DEM,INDEF,POSS,PPRO,REL,WPRO,V}, --pos {NN,NPROP,ADJ,CARD,ORD,FRAC,ART,DEM,INDEF,POSS,PPRO,REL,WPRO,V}
part of speech
-P, --plain suppress color and formatting
-s, --nonst output also non-standard forms
-S, --ch output also forms in Swiss spelling
-t {lemma,morph,index,finite,root}, --automaton-type {lemma,morph,index,finite,root}
automaton type (default: index)
-u, --user-specified use only user-specified information
-v, --version show program's version number and exit
-y, --yaml output YAML document
Development
DWDSmor is in active development. In its current stage, it supports all major inflection classes and some productive word-formation patterns of written German.
Prerequisites
- GNU/Linux: Development, builds and tests of DWDSmor are performed on Debian GNU/Linux. While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported.
- Python ≥ v3.10: DWDSmor provides a Python interface for building DWDSmor lexica from source lexica in the DWDS XML format and for compiling DWDSmor automata from the resulting DWDSmor lexica and the DWDSmor grammar.
- Saxon-HE: The entries in DWDSmor lexica are extracted from source lexica in the DWDS XML format by means of XSLT 2 stylesheets, using Saxon-HE as an XSLT processor. Saxon requires a Java runtime environment.
- SFST: The DWDSmor automata are compiled using the SFST C++ library and toolbox for finite-state transducers (FSTs).
On a Debian-based distribution, the following command install the required software:
apt-get install python3 default-jdk libsaxonhe-java sfst
Project setup
Optionally, set up a Python virtual environment for project builds,
i. e. via Python’s venv:
python3 -m venv .venv
source .venv/bin/activate
Then install DWDSmor, including development dependencies:
pip install -U pip setuptools && pip install -e '.[dev]'
Building lexica and automata
Building different editions is facilitated via the script build-dwdsmor:
$ ./build-dwdsmor -h
usage: cli.py [-h] [-e {open,dwds,dev}] [-t {root,index,morph,finite,lemma}]
[-m] [-f] [-q] [--release] [--tag]
Build DWDSmor.
options:
-h, --help show this help message and exit
-e {open,dwds,dev}, --edition {open,dwds,dev}
edition to build (default: all)
-t {root,index,morph,finite,lemma}, --automaton-type {root,index,morph,finite,lemma}
automaton type to build (default: all)
-m, --with-metrics measure UD/de-hdt coverage
-f, --force force building (also current targets)
-q, --quiet do not report progress
--release push automata to HF hub
--tag tag HF hub release with current version
To build all editions available in the current git checkout, run:
./build-dwdsmor
The build result can be found in build/ with one subdirectory per
edition.
Testing
In order to test basic transducer usage and for potential regressions, run
pytest
License
As the original SMOR and SMORLemma grammars, the DWDSmor grammar and the DWDSmor Python library are licensed under the GNU General Public License v2.0. The same applies to the sample source lexicon and the automata of the Open Edition.
For the DWDS Edition based on the complete DWDS dictionary, all rights are reserved and individual license terms apply. If you are interested in the automata of the DWDS Edition, please contact us.
Contact
Feel free to contact Andreas Nolda for any question about this project.
Credits
DWSDmor is based on the following software and datasets:
- SFST, a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006).
- SMORLemma (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology (SMOR) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component.
- the DWDS dictionary (BBAW n.d.) replacing the IMSLex (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphosyntactic properties.
References
- Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Online
- Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. PDF
- Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In Headedness and/or Grammatical Anarchy?, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. PDF.
- Schmid, Helmut (2006). A programming language for finite state transducers. In Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. PDF.
- Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino et al., European Language Resources Association, 1263–1266. PDF
- Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari et al., European Language Resources Association, 1063–1067. PDF.
Collection including zentrum-lexikographie/dwdsmor-open
Evaluation results
- Coverage on Universal Dependencies Treebank (de-hdt)self-reported0.858
- Coverage ($() on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage ($,) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage ($.) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage (ADJA) on Universal Dependencies Treebank (de-hdt)self-reported0.777
- Coverage (ADJD) on Universal Dependencies Treebank (de-hdt)self-reported0.769
- Coverage (ADV) on Universal Dependencies Treebank (de-hdt)self-reported0.969
- Coverage (APPO) on Universal Dependencies Treebank (de-hdt)self-reported0.999
- Coverage (APPR) on Universal Dependencies Treebank (de-hdt)self-reported0.998
- Coverage (APPRART) on Universal Dependencies Treebank (de-hdt)self-reported0.997
- Coverage (APZR) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage (ART) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage (CARD) on Universal Dependencies Treebank (de-hdt)self-reported0.962
- Coverage (FM) on Universal Dependencies Treebank (de-hdt)self-reported0.135
- Coverage (ITJ) on Universal Dependencies Treebank (de-hdt)self-reported0.714
- Coverage (KOKOM) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage (KON) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage (KOUI) on Universal Dependencies Treebank (de-hdt)self-reported1.000
- Coverage (KOUS) on Universal Dependencies Treebank (de-hdt)self-reported0.986
- Coverage (NE) on Universal Dependencies Treebank (de-hdt)self-reported0.063