de_STTS2_folk_normal_orth tagger

This is a spaCy language model trained to use the Stuttgart-Tübingen Tagset version 2.0, which was designed to tag transcripts of conversational speech in German. The model may be useful for tagging ASR transcripts such as those collected in the CoGS corpus.

The model was trained using the tag annotations from the FOLK corpus at https://agd.ids-mannheim.de/folk-gold.shtml, employing an 80/20 training/test split. This version of the tagger was trained using data in standard German orthography with regards to upper and lower case of characters.

Usage example:

!pip install https://huggingface.co/stcoats/de_STTS2_folk_normal_orth/resolve/main/de_STTS2_folk_normal_orth-any-py3-none-any.whl
import spacy
import de_STTS2_folk_normal_orth
nlp = de_STTS2_folk_normal_orth.load()
doc = nlp("ach so meinst du wir sollen es jetzt tun")
for token in doc:
    print(token.text, token.tag_)

References

Coats, Steven. (2023). A new corpus of geolocated ASR transcripts from Germany. Language Resources and Evaluation. https://doi.org/10.1007/s10579-023-09686-9

Westpfahl, Swantje and Thomas Schmidt. (2016): FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia

Feature	Description
Name	`de_STTS2_folk_normal_orth`
Version	`0.0.1`
spaCy	`>=3.5.1,<3.6.0`
Default Pipeline	`tok2vec`, `tagger`
Components	`tok2vec`, `tagger`
Vectors	0 keys, 0 unique vectors (0 dimensions)
Sources	n/a
License	n/a
Author	n/a

Label Scheme

View label scheme (62 labels for 1 components)

Component Labels

tagger $., AB, ADJA, ADJD, ADV, APPO, APPR, APPRART, APZR, ART, CARD, FM, KOKOM, KON, KOUI, KOUS, NE, NGAKW, NGHES, NGIRR, NGONO, NN, ORD, PDAT, PDS, PIAT, PIDAT, PIDS, PIS, PPER, PPOSAT, PPOSS, PRELAT, PRELS, PRF, PTKA, PTKIFG, PTKMA, PTKMWL, PTKNEG, PTKVZ, PTKZU, PWAT, PWAV, PWS, SEDM, SEQU, SPELL, TRUNC, UI, VAFIN, VAIMP, VAINF, VAPP, VMFIN, VMINF, VVFIN, VVIMP, VVINF, VVIZU, VVPP, XY

Component	Labels
`tagger`	`$.`, `AB`, `ADJA`, `ADJD`, `ADV`, `APPO`, `APPR`, `APPRART`, `APZR`, `ART`, `CARD`, `FM`, `KOKOM`, `KON`, `KOUI`, `KOUS`, `NE`, `NGAKW`, `NGHES`, `NGIRR`, `NGONO`, `NN`, `ORD`, `PDAT`, `PDS`, `PIAT`, `PIDAT`, `PIDS`, `PIS`, `PPER`, `PPOSAT`, `PPOSS`, `PRELAT`, `PRELS`, `PRF`, `PTKA`, `PTKIFG`, `PTKMA`, `PTKMWL`, `PTKNEG`, `PTKVZ`, `PTKZU`, `PWAT`, `PWAV`, `PWS`, `SEDM`, `SEQU`, `SPELL`, `TRUNC`, `UI`, `VAFIN`, `VAIMP`, `VAINF`, `VAPP`, `VMFIN`, `VMINF`, `VVFIN`, `VVIMP`, `VVINF`, `VVIZU`, `VVPP`, `XY`

Accuracy

Type	Score
`TAG_ACC`	93.80
`TOK2VEC_LOSS`	204127.79
`TAGGER_LOSS`	119369.65

Downloads last month: 5

Evaluation results

TAG (XPOS) Accuracy
self-reported

0.938