|
|
--- |
|
|
tags: |
|
|
- spacy |
|
|
- token-classification |
|
|
language: |
|
|
- de |
|
|
model-index: |
|
|
- name: de_STTS2_folk_normal_orth |
|
|
results: |
|
|
- task: |
|
|
name: TAG |
|
|
type: token-classification |
|
|
metrics: |
|
|
- name: TAG (XPOS) Accuracy |
|
|
type: accuracy |
|
|
value: 0.9379513783 |
|
|
--- |
|
|
## de_STTS2_folk_normal_orth tagger |
|
|
|
|
|
This is a spaCy language model trained to use the Stuttgart-Tübingen Tagset version 2.0, which was designed to tag transcripts of conversational speech in German. |
|
|
The model may be useful for tagging ASR transcripts such as those collected in the [CoGS](https://cc.oulu.fi/~scoats/CoGS.html) corpus. |
|
|
|
|
|
The model was trained using the tag annotations from the FOLK corpus at https://agd.ids-mannheim.de/folk-gold.shtml, employing an 80/20 training/test split. This version of the tagger was trained using data in standard German orthography with regards to upper and lower case of characters. |
|
|
|
|
|
Usage example: |
|
|
```python |
|
|
!pip install https://huggingface.co/stcoats/de_STTS2_folk_normal_orth/resolve/main/de_STTS2_folk_normal_orth-any-py3-none-any.whl |
|
|
import spacy |
|
|
import de_STTS2_folk_normal_orth |
|
|
nlp = de_STTS2_folk_normal_orth.load() |
|
|
doc = nlp("ach so meinst du wir sollen es jetzt tun") |
|
|
for token in doc: |
|
|
print(token.text, token.tag_) |
|
|
``` |
|
|
### References |
|
|
|
|
|
Coats, Steven. (2023). A new corpus of geolocated ASR transcripts from Germany. <i>Language Resources and Evaluation</i>. https://doi.org/10.1007/s10579-023-09686-9 |
|
|
|
|
|
Westpfahl, Swantje and Thomas Schmidt. (2016): [FOLK-Gold – A GOLD standard for Part-of-Speech-Tagging of Spoken German](https://aclanthology.org/L16-1237). In: <i>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia</i> |
|
|
|
|
|
--- |
|
|
|
|
|
| Feature | Description | |
|
|
| --- | --- | |
|
|
| **Name** | `de_STTS2_folk_normal_orth` | |
|
|
| **Version** | `0.0.1` | |
|
|
| **spaCy** | `>=3.5.1,<3.6.0` | |
|
|
| **Default Pipeline** | `tok2vec`, `tagger` | |
|
|
| **Components** | `tok2vec`, `tagger` | |
|
|
| **Vectors** | 0 keys, 0 unique vectors (0 dimensions) | |
|
|
| **Sources** | n/a | |
|
|
| **License** | n/a | |
|
|
| **Author** | [n/a]() | |
|
|
|
|
|
### Label Scheme |
|
|
|
|
|
<details> |
|
|
|
|
|
<summary>View label scheme (62 labels for 1 components)</summary> |
|
|
|
|
|
| Component | Labels | |
|
|
| --- | --- | |
|
|
| **`tagger`** | `$.`, `AB`, `ADJA`, `ADJD`, `ADV`, `APPO`, `APPR`, `APPRART`, `APZR`, `ART`, `CARD`, `FM`, `KOKOM`, `KON`, `KOUI`, `KOUS`, `NE`, `NGAKW`, `NGHES`, `NGIRR`, `NGONO`, `NN`, `ORD`, `PDAT`, `PDS`, `PIAT`, `PIDAT`, `PIDS`, `PIS`, `PPER`, `PPOSAT`, `PPOSS`, `PRELAT`, `PRELS`, `PRF`, `PTKA`, `PTKIFG`, `PTKMA`, `PTKMWL`, `PTKNEG`, `PTKVZ`, `PTKZU`, `PWAT`, `PWAV`, `PWS`, `SEDM`, `SEQU`, `SPELL`, `TRUNC`, `UI`, `VAFIN`, `VAIMP`, `VAINF`, `VAPP`, `VMFIN`, `VMINF`, `VVFIN`, `VVIMP`, `VVINF`, `VVIZU`, `VVPP`, `XY` | |
|
|
|
|
|
</details> |
|
|
|
|
|
### Accuracy |
|
|
|
|
|
| Type | Score | |
|
|
| --- | --- | |
|
|
| `TAG_ACC` | 93.80 | |
|
|
| `TOK2VEC_LOSS` | 204127.79 | |
|
|
| `TAGGER_LOSS` | 119369.65 | |