| --- |
| language: |
| - de |
| license: apache-2.0 |
| tags: |
| - austrian-german |
| - dialect |
| - classification |
| - dach |
| - text-classification |
| library_name: transformers |
| pipeline_tag: text-classification |
| base_model: bert-base-german-cased |
| extra_gated_prompt: >- |
| This model is in validation phase. |
| Access is granted to verified researchers and organizations. |
| Please describe your intended use case. |
| extra_gated_fields: |
| Full name: text |
| Organization: text |
| Intended use: text |
| I agree to use this data for research purposes only: |
| type: checkbox |
| --- |
| |
| # DACH Dialect Classifier |
|
|
| Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned `bert-base-german-cased` on 1500 synthetic examples. |
|
|
| ## Results |
|
|
| Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out): |
|
|
| | | Precision | Recall | F1 | |
| |--|-----------|--------|-----| |
| | AT | 0.96 | 0.96 | 0.96 | |
| | DE | 0.96 | 1.00 | 0.98 | |
| | CH | 0.98 | 0.94 | 0.96 | |
| | **Macro avg** | **0.97** | **0.97** | **0.97** | |
|
|
| Accuracy: **96.7%** |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier") |
| |
| clf("I hob ma gestern a Semmerl und an Leberkas gholt.") |
| # [{'label': 'AT', 'score': 0.98}] |
| |
| clf("Ich habe mir gestern ein Broetchen geholt.") |
| # [{'label': 'DE', 'score': 0.97}] |
| |
| clf("Ich ha mir geschter es Broetli gholt.") |
| # [{'label': 'CH', 'score': 0.95}] |
| ``` |
|
|
| ## Training data |
|
|
| 1500 synthetic examples — 500 each for AT, DE, CH. The texts use real lexical markers for each variety: |
|
|
| **Austrian:** leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum |
|
|
| **German:** Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah") |
|
|
| **Swiss:** isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi |
|
|
| ## Limitations |
|
|
| This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it. |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|