language:
- de
license: apache-2.0
tags:
- austrian-german
- dialect
- classification
- dach
- text-classification
library_name: transformers
pipeline_tag: text-classification
base_model: bert-base-german-cased
extra_gated_prompt: >-
This model is in validation phase. Access is granted to verified researchers
and organizations. Please describe your intended use case.
extra_gated_fields:
Full name: text
Organization: text
Intended use: text
I agree to use this data for research purposes only:
type: checkbox
DACH Dialect Classifier
Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned bert-base-german-cased on 1500 synthetic examples.
Results
Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out):
| Precision | Recall | F1 | |
|---|---|---|---|
| AT | 0.96 | 0.96 | 0.96 |
| DE | 0.96 | 1.00 | 0.98 |
| CH | 0.98 | 0.94 | 0.96 |
| Macro avg | 0.97 | 0.97 | 0.97 |
Accuracy: 96.7%
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier")
clf("I hob ma gestern a Semmerl und an Leberkas gholt.")
# [{'label': 'AT', 'score': 0.98}]
clf("Ich habe mir gestern ein Broetchen geholt.")
# [{'label': 'DE', 'score': 0.97}]
clf("Ich ha mir geschter es Broetli gholt.")
# [{'label': 'CH', 'score': 0.95}]
Training data
1500 synthetic examples — 500 each for AT, DE, CH. The texts use real lexical markers for each variety:
Austrian: leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum
German: Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah")
Swiss: isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi
Limitations
This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it.
License
Apache 2.0