Laborator's picture
Update README with training results: 96.7% accuracy
d629e2b verified
---
language:
- de
license: apache-2.0
tags:
- austrian-german
- dialect
- classification
- dach
- text-classification
library_name: transformers
pipeline_tag: text-classification
base_model: bert-base-german-cased
extra_gated_prompt: >-
This model is in validation phase.
Access is granted to verified researchers and organizations.
Please describe your intended use case.
extra_gated_fields:
Full name: text
Organization: text
Intended use: text
I agree to use this data for research purposes only:
type: checkbox
---
# DACH Dialect Classifier
Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned `bert-base-german-cased` on 1500 synthetic examples.
## Results
Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out):
| | Precision | Recall | F1 |
|--|-----------|--------|-----|
| AT | 0.96 | 0.96 | 0.96 |
| DE | 0.96 | 1.00 | 0.98 |
| CH | 0.98 | 0.94 | 0.96 |
| **Macro avg** | **0.97** | **0.97** | **0.97** |
Accuracy: **96.7%**
## Usage
```python
from transformers import pipeline
clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier")
clf("I hob ma gestern a Semmerl und an Leberkas gholt.")
# [{'label': 'AT', 'score': 0.98}]
clf("Ich habe mir gestern ein Broetchen geholt.")
# [{'label': 'DE', 'score': 0.97}]
clf("Ich ha mir geschter es Broetli gholt.")
# [{'label': 'CH', 'score': 0.95}]
```
## Training data
1500 synthetic examples — 500 each for AT, DE, CH. The texts use real lexical markers for each variety:
**Austrian:** leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum
**German:** Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah")
**Swiss:** isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi
## Limitations
This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it.
## License
Apache 2.0