You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model is in validation phase. Access is granted to verified researchers and organizations. Please describe your intended use case.

Log in or Sign Up to review the conditions and access this model content.

DACH Dialect Classifier

Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned bert-base-german-cased on 1500 synthetic examples.

Results

Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out):

Precision Recall F1
AT 0.96 0.96 0.96
DE 0.96 1.00 0.98
CH 0.98 0.94 0.96
Macro avg 0.97 0.97 0.97

Accuracy: 96.7%

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier")

clf("I hob ma gestern a Semmerl und an Leberkas gholt.")
# [{'label': 'AT', 'score': 0.98}]

clf("Ich habe mir gestern ein Broetchen geholt.")
# [{'label': 'DE', 'score': 0.97}]

clf("Ich ha mir geschter es Broetli gholt.")
# [{'label': 'CH', 'score': 0.95}]

Training data

1500 synthetic examples โ€” 500 each for AT, DE, CH. The texts use real lexical markers for each variety:

Austrian: leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum

German: Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah")

Swiss: isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi

Limitations

This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it.

License

Apache 2.0

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Laborator/dach-dialect-classifier

Finetuned
(162)
this model

Space using Laborator/dach-dialect-classifier 1