DWDSmor – German Morphology

DWDSmor implements the lemmatisation and morphosyntactic analysis of word forms as well as the generation of paradigms of lexical words in written German. DWDSmor’s finite state transducers (the DWDSmor automata) map word forms to specifications of corresponding lexical words and morphosyntactic categories. By traversing such transducers

  1. a given word form can be analysed and lemmatised, or
  2. a lexical word together with a set of morphosyntactic tags will generate corresponding inflected word forms.

The DWDSmor automata are compiled and traversed via SFST, a C++ library and toolbox for finite-state transducers (FSTs). In addition, a DWDSmor Python library is provided, using the SFST Python bindings.

The coverage of the DWDSmor automata of the German language depends on

  1. a DWDSmor lexicon, which declares lexical entries with lemmas, stem forms, word classes, inflection classes, etc., and
  2. the DWDSmor grammar, which defines lemmatisation, inflection, and word-formation rules for written German.

While the DWDSmor grammar for word-formation is still work in progress, its inflection grammar is pretty comprehensive. The inflection grammar as well as the lexicon format are based on (heavily modified) code from SMORLemma, which in turn is derived from the Stuttgart Morphology (SMOR).

As a rule, the entries in a DWDSmor lexicon are extracted from a source lexicon comprising a set of XML files in the format of the DWDS dictionary.

From a DWDSmor lexicon and the DWDSmor grammar, a DWDSmor edition can be compiled, containing several automata types in standard (.a) for generation and in compact format (.ca) for analysis:

  • lemma.{a,ca}: transducer with inflection and word-formation components, for lemmatisation and morphosyntactic analysis of word forms in terms of grammatical categories.
  • morph.{a,ca}: transducer with inflection and word-formation components, for the generation of morphologically segmented word forms.
  • finite.{a,ca}: transducer with an inflection component and a finite word-formation component, for testing purposes.
  • root.{a,ca}: transducer with inflection and word-formation components, for lexical analysis of word forms in terms of root lemmas (i.e., lemmas of ultimate word-formation bases), word-formation process, word-formation means, and grammatical categories in term of the Pattern-and-Restriction Theory of word formation (Nolda 2022).
  • index.{a,ca}: transducer with an inflection component only with DWDS homographic lemma indices, for paradigm generation.

Currently, the following DWDSmor editions are supported:

  1. the DWDS Edition, derived from the complete lexical dataset of the DWDS dictionary, and
  2. the Open Edition, based on a sample selection of DWDS lemmas and their grammatical specifications.

The automata of the DWDS Edition are available upon request for research purposes. The automata of the Open Edition, as well as its sample source lexicon, are released freely for general use and experiments. For testing purposes, DWDSmor also allows for compiling a development edition from a user-provided source lexicon.

The coverage of the released DWDSmor automata is measured against the German Universal Dependencies HDT treebank and reported on the Hugging Face Hub page. In the DWDS Edition, the coverage ratios typically range from 95 % to 100 % for most word classes; notable exceptions include foreign-language words and named entities, which are barely part of the underlying DWDS dictionary and thus poorly covered by DWDSmor. In the Open Edition, the coverage ratios of open word classes are lower, due to the limited size of the sample source lexicon.

Usage

The DWDSmor Python library is available via the Python Package Index (PyPI):

pip install dwdsmor

The library can be used for lemmatisation:

>>> import dwdsmor
>>> lemmatizer = dwdsmor.lemmatizer()
>>> assert lemmatizer("getestet", pos={"V"}).analysis == "testen"
>>> assert lemmatizer("getestet", pos={"ADJ"}).analysis == "getestet"

There is also integration with spacy:

>>> import spacy
>>> import dwdsmor.spacy
>>> nlp = spacy.load("de_hdt_lg")
>>> nlp.add_pipe("dwdsmor")
<dwdsmor.spacy.Component object at 0x7f99e634f220>
>>> tuple((t.lemma_, t._.dwdsmor.analysis) for t in nlp("Das ist ein Test."))
(('der', 'die'), ('sein', 'sein'), ('ein', 'eine'), ('Test', 'Test'), ('.', '.'))

In addition to the Python API, the package provides a simple command-line interface named dwdsmor. To analyze a word form, pass it as an argument:

$ dwdsmor gebildet
Wordform  	Lemma   	Analysis                           	POS  	Degree  	Function  	Nonfinite  	Tense  	Auxiliary
gebildet  	bilden  	bild<~>en<+V><Part><Perf><haben>   	V   	        	          	Part       	Perf   	haben
gebildet  	gebildet	ge<~>bild<~>et<+ADJ><Pos><Pred/Adv>	ADJ 	Pos     	Pred/Adv

To generate all word forms for a lexical word, pass it (or a form which can be analyzed as the lexical word) as an argument together with the option -g:

$ dwdsmor -g gebildet
[…]
Wordform  	Lemma   	Analysis                                             	POS  	Degree  	Function  	  Person	Gender  	Case  	Number  	Nonfinite  	Tense  	Mood  	Auxiliary  	Inflection
[…]
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Acc   	Sg      	           	       	      	           	St
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Acc><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Acc   	Sg      	           	       	      	           	Wk
gebildeter	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Dat   	Sg      	           	       	      	           	St
gebildeten	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Dat><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Dat   	Sg      	           	       	      	           	Wk
gebildeter	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Gen   	Sg      	           	       	      	           	St
gebildeten	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Gen><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Gen   	Sg      	           	       	      	           	Wk
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><St>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Nom   	Sg      	           	       	      	           	St
gebildete 	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Fem><Nom><Sg><Wk>    	ADJ 	Pos     	Attr/Subst	        	Fem     	Nom   	Sg      	           	       	      	           	Wk
gebildeten	gebildet	gebildet<+ADJ><Pos><Attr/Subst><Masc><Acc><Sg><St>   	ADJ 	Pos     	Attr/Subst	        	Masc    	Acc   	Sg      	           	       	      	           	St
[…]
bildeten  	bilden  	bild<~>en<+V><1><Pl><Past><Ind>                      	V   	        	          	       1	        	      	Pl      	           	Past   	Ind
bildeten  	bilden  	bild<~>en<+V><1><Pl><Past><Subj>                     	V   	        	          	       1	        	      	Pl      	           	Past   	Subj
bilden    	bilden  	bild<~>en<+V><1><Pl><Pres><Ind>                      	V   	        	          	       1	        	      	Pl      	           	Pres   	Ind
bilden    	bilden  	bild<~>en<+V><1><Pl><Pres><Subj>                     	V   	        	          	       1	        	      	Pl      	           	Pres   	Subj
bildete   	bilden  	bild<~>en<+V><1><Sg><Past><Ind>                      	V   	        	          	       1	        	      	Sg      	           	Past   	Ind
bildete   	bilden  	bild<~>en<+V><1><Sg><Past><Subj>                     	V   	        	          	       1	        	      	Sg      	           	Past   	Subj
bilde     	bilden  	bild<~>en<+V><1><Sg><Pres><Ind>                      	V   	        	          	       1	        	      	Sg      	           	Pres   	Ind
bilde     	bilden  	bild<~>en<+V><1><Sg><Pres><Subj>                     	V   	        	          	       1	        	      	Sg      	           	Pres   	Subj
bildetet  	bilden  	bild<~>en<+V><2><Pl><Past><Ind>                      	V   	        	          	       2	        	      	Pl      	           	Past   	Ind
[…]

More sophisticated tools for analysis and paradigm generation with the DWDSmor Python library are provided by the Python scripts analysis.py and paradigm.py in the tools/ subdirectory:

$ ./tools/analysis.py -h
usage: analysis.py [-h] [-a] [-c] [-C] [-d AUTOMATA_DIR] [-e] [-H] [-I] [-j]
                   [-m] [-M] [-P] [-s] [-S] [-t {root,finite,morph,index,lemma}]
                   [-T {root,finite,morph,index,lemma}] [-v] [-y]
                   [input] [output]

positional arguments:
  input                 input file (one word form per line; default: stdin)
  output                output file (default: stdout)

options:
  -h, --help            show this help message and exit
  -a, --analysis-string output also analysis string
  -c, --csv             output CSV table
  -C, --force-color     preserve color and formatting when piping output
  -d AUTOMATA_DIR, --automata-dir AUTOMATA_DIR
                        automata directory (default: build/open)
  -e, --empty           show empty columns or values
  -H, --no-header       suppress table header
  -I, --no-info         do not show analyses with info tags
  -j, --json            output JSON object
  -m, --minimal         prefer lemmas with minimal number of boundaries
  -M, --maximal         prefer word forms with maximal number of boundaries (requires secondary automaton)
  -P, --plain           suppress color and formatting
  -s, --seg-lemma       output also segmented lemma
  -S, --seg-word        output also segmented word form (requires secondary automaton)
  -t {root,finite,morph,index,lemma}, --automaton-type {root,finite,morph,index,lemma}
                        type of primary automaton (default: lemma)
  -T {root,finite,morph,index,lemma}, --automaton2-type {root,finite,morph,index,lemma}
                        type of secondary automaton (default: morph)
  -v, --version         show program's version number and exit
  -y, --yaml            output YAML document
$ ./tools/paradigm.py -h
usage: paradigm.py [-h] [-c] [-C] [-d AUTOMATA_DIR] [-e] [-H]
                   [-i {1,2,3,4,5,6,7,8}] [-I {1,2,3,4,5,6,7,8}] [-j] [-n] [-N] [-o] [-O]
                   [-p {NN,NPROP,ADJ,CARD,ORD,FRAC,ART,DEM,INDEF,POSS,PPRO,REL,WPRO,V}]
                   [-P] [-s] [-S] [-t {lemma,morph,index,finite,root}] [-u] [-v] [-y]
                   lemma [output]

positional arguments:
  lemma                 lemma (determiners: Fem Nom Sg; nominalised adjectives: Wk)
  output                output file (default: stdout)

options:
  -h, --help            show this help message and exit
  -c, --csv             output CSV table
  -C, --force-color     preserve color and formatting when piping output
  -d AUTOMATA_DIR, --automata-dir AUTOMATA_DIR
                        automata directory (default: build/dwds)
  -e, --empty           show empty columns or values
  -H, --no-header       suppress table header
  -i {1,2,3,4,5,6,7,8}, --lemma-index {1,2,3,4,5,6,7,8}
                        homographic lemma index
  -I {1,2,3,4,5,6,7,8}, --paradigm-index {1,2,3,4,5,6,7,8}
                        paradigm index
  -j, --json            output JSON object
  -n, --no-cats         do not output category names
  -N, --no-lemma        do not output lemma, lemma index, paradigm index, and lexical categories
  -o, --old             output also archaic forms
  -O, --oldorth         output also forms in old spelling
  -p {NN,NPROP,ADJ,CARD,ORD,FRAC,ART,DEM,INDEF,POSS,PPRO,REL,WPRO,V}, --pos {NN,NPROP,ADJ,CARD,ORD,FRAC,ART,DEM,INDEF,POSS,PPRO,REL,WPRO,V}
                        part of speech
  -P, --plain           suppress color and formatting
  -s, --nonst           output also non-standard forms
  -S, --ch              output also forms in Swiss spelling
  -t {lemma,morph,index,finite,root}, --automaton-type {lemma,morph,index,finite,root}
                        automaton type (default: index)
  -u, --user-specified  use only user-specified information
  -v, --version         show program's version number and exit
  -y, --yaml            output YAML document

Development

DWDSmor is in active development. In its current stage, it supports all major inflection classes and some productive word-formation patterns of written German.

Prerequisites

  • GNU/Linux: Development, builds and tests of DWDSmor are performed on Debian GNU/Linux. While other UNIX-like operating systems such as MacOS should work, too, they are not actively supported.
  • Python ≥ v3.10: DWDSmor provides a Python interface for building DWDSmor lexica from source lexica in the DWDS XML format and for compiling DWDSmor automata from the resulting DWDSmor lexica and the DWDSmor grammar.
  • Saxon-HE: The entries in DWDSmor lexica are extracted from source lexica in the DWDS XML format by means of XSLT 2 stylesheets, using Saxon-HE as an XSLT processor. Saxon requires a Java runtime environment.
  • SFST: The DWDSmor automata are compiled using the SFST C++ library and toolbox for finite-state transducers (FSTs).

On a Debian-based distribution, the following command install the required software:

apt-get install python3 default-jdk libsaxonhe-java sfst

Project setup

Optionally, set up a Python virtual environment for project builds, i. e. via Python’s venv:

python3 -m venv .venv
source .venv/bin/activate

Then install DWDSmor, including development dependencies:

pip install -U pip setuptools && pip install -e '.[dev]'

Building lexica and automata

Building different editions is facilitated via the script build-dwdsmor:

$ ./build-dwdsmor -h
usage: cli.py [-h] [-e {open,dwds,dev}] [-t {root,index,morph,finite,lemma}]
              [-m] [-f] [-q] [--release] [--tag]

Build DWDSmor.

options:
  -h, --help            show this help message and exit
  -e {open,dwds,dev}, --edition {open,dwds,dev}
                        edition to build (default: all)
  -t {root,index,morph,finite,lemma}, --automaton-type {root,index,morph,finite,lemma}
                        automaton type to build (default: all)
  -m, --with-metrics    measure UD/de-hdt coverage
  -f, --force           force building (also current targets)
  -q, --quiet           do not report progress
  --release             push automata to HF hub
  --tag                 tag HF hub release with current version

To build all editions available in the current git checkout, run:

./build-dwdsmor

The build result can be found in build/ with one subdirectory per edition.

Testing

In order to test basic transducer usage and for potential regressions, run

pytest

License

As the original SMOR and SMORLemma grammars, the DWDSmor grammar and the DWDSmor Python library are licensed under the GNU General Public License v2.0. The same applies to the sample source lexicon and the automata of the Open Edition.

For the DWDS Edition based on the complete DWDS dictionary, all rights are reserved and individual license terms apply. If you are interested in the automata of the DWDS Edition, please contact us.

Contact

Feel free to contact Andreas Nolda for any question about this project.

Credits

DWSDmor is based on the following software and datasets:

  1. SFST, a C++ library and toolbox for finite-state transducers (FSTs) (Schmidt 2006).
  2. SMORLemma (Sennrich and Kunz 2014), a modified version of the Stuttgart Morphology (SMOR) (Schmid, Fitschen, and Heid 2004) with an alternative lemmatisation component.
  3. the DWDS dictionary (BBAW n.d.) replacing the IMSLex (Fitschen 2004) as the lexical data source for German words, their grammatical categories, and their morphosyntactic properties.

References

  • Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) (ed.) (n.d.). DWDS – Digitales Wörterbuch der deutschen Sprache: Das Wortauskunftssystem zur deutschen Sprache in Geschichte und Gegenwart. Online
  • Fitschen, Arne (2004). Ein computerlinguistisches Lexikon als komplexes System. Ph.D. thesis, Universität Stuttgart. PDF
  • Nolda, Andreas (2022). Headedness as an epiphenomenon: Case studies on compounding and blending in German. In Headedness and/or Grammatical Anarchy?, ed. by Ulrike Freywald, Horst Simon, and Stefan Müller, Empirically Oriented Theoretical Morphology and Syntax 11, Berlin: Language Science Press, 343–376. PDF.
  • Schmid, Helmut (2006). A programming language for finite state transducers. In Finite-State Methods and Natural Language Processing: 5th International Workshop, FSMNLP 2005, Helsinki, Finland, September 1–2, 2005, ed. by Anssi Yli-Jyrä, Lauri Karttunen, and Juhani Karhumäki, Lecture Notes in Artificial Intelligence 4002, Berlin: Springer, 1263–1266. PDF.
  • Schmid, Helmut, Arne Fitschen, and Ulrich Heid (2004). SMOR: A German computational morphology covering derivation, composition, and inflection. In LREC 2004: Fourth International Conference on Language Resources and Evaluation, ed. by Maria T. Lino et al., European Language Resources Association, 1263–1266. PDF
  • Sennrich, Rico and Beta Kunz (2014). Zmorge: A German morphological lexicon extracted from Wiktionary. In LREC 2014: Ninth International Conference on Language Resources and Evaluation, ed. by Nicoletta Calzolari et al., European Language Resources Association, 1063–1067. PDF.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including zentrum-lexikographie/dwdsmor-open

Evaluation results

  • Coverage on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.858
  • Coverage ($() on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage ($,) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage ($.) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage (ADJA) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.777
  • Coverage (ADJD) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.769
  • Coverage (ADV) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.969
  • Coverage (APPO) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.999
  • Coverage (APPR) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.998
  • Coverage (APPRART) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.997
  • Coverage (APZR) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage (ART) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage (CARD) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.962
  • Coverage (FM) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.135
  • Coverage (ITJ) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.714
  • Coverage (KOKOM) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage (KON) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage (KOUI) on Universal Dependencies Treebank (de-hdt)
    self-reported
    1.000
  • Coverage (KOUS) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.986
  • Coverage (NE) on Universal Dependencies Treebank (de-hdt)
    self-reported
    0.063