Update README with training results: 96.7% accuracy

d629e2b verified 6 days ago

2.12 kB

language:
  - de
license: apache-2.0
tags:
  - austrian-german
  - dialect
  - classification
  - dach
  - text-classification
library_name: transformers
pipeline_tag: text-classification
base_model: bert-base-german-cased
extra_gated_prompt: >-
  This model is in validation phase. Access is granted to verified researchers
  and organizations. Please describe your intended use case.
extra_gated_fields:
  Full name: text
  Organization: text
  Intended use: text
  I agree to use this data for research purposes only:
    type: checkbox

DACH Dialect Classifier

Classifies German text as Austrian (AT), German (DE), or Swiss (CH). Fine-tuned bert-base-german-cased on 1500 synthetic examples.

Results

Trained for 5 epochs on a RTX 3090 Ti. Test set (150 examples, held out):

	Precision	Recall	F1
AT	0.96	0.96	0.96
DE	0.96	1.00	0.98
CH	0.98	0.94	0.96
Macro avg	0.97	0.97	0.97

Accuracy: 96.7%

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="Laborator/dach-dialect-classifier")

clf("I hob ma gestern a Semmerl und an Leberkas gholt.")
# [{'label': 'AT', 'score': 0.98}]

clf("Ich habe mir gestern ein Broetchen geholt.")
# [{'label': 'DE', 'score': 0.97}]

clf("Ich ha mir geschter es Broetli gholt.")
# [{'label': 'CH', 'score': 0.95}]

Training data

1500 synthetic examples — 500 each for AT, DE, CH. The texts use real lexical markers for each variety:

Austrian: leiwand, Oida, Beisl, Sackerl, Bim, Semmel, Erdapfel, Topfen, Paradeiser, heuer, Perfekt instead of Praeteritum

German: Tuete, Broetchen, Kartoffel, Strassenbahn, Buergeramt, consistent Praeteritum ("ich ging", "ich sah")

Swiss: isch, haet, gmacht, Velo, Natel, Znueni, Muesli, Ruebli, Haerdoepfel, Grueezi

Limitations

This was trained on synthetic data. It catches vocabulary differences well but might miss subtler dialectal features or code-switching. Adding real text from parliament protocols, news, and forums woud improve it.

License

Apache 2.0