You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This model is in validation phase. Access is granted to verified researchers and organizations. Please describe your intended use case.
Log in or Sign Up to review the conditions and access this model content.
DACH Dialect Classifier
Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned bert-base-german-cased on 1500 synthetic examples.
Results
Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out):
| Precision | Recall | F1 | |
|---|---|---|---|
| AT | 0.96 | 0.96 | 0.96 |
| DE | 0.96 | 1.00 | 0.98 |
| CH | 0.98 | 0.94 | 0.96 |
| Macro avg | 0.97 | 0.97 | 0.97 |
Accuracy: 96.7%
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier")
clf("I hob ma gestern a Semmerl und an Leberkas gholt.")
# [{'label': 'AT', 'score': 0.98}]
clf("Ich habe mir gestern ein Broetchen geholt.")
# [{'label': 'DE', 'score': 0.97}]
clf("Ich ha mir geschter es Broetli gholt.")
# [{'label': 'CH', 'score': 0.95}]
Training data
1500 synthetic examples โ 500 each for AT, DE, CH. The texts use real lexical markers for each variety:
Austrian: leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum
German: Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah")
Swiss: isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi
Limitations
This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it.
License
Apache 2.0
- Downloads last month
- -
Model tree for Laborator/dach-dialect-classifier
Base model
google-bert/bert-base-german-cased