flix-swiss-german-full

A fine-tuned version of openai/whisper-large-v3 for Swiss German (Schweizerdeutsch) automatic speech recognition. The model transcribes Swiss German dialect speech into grammatically correct Standard German text.

This is the first publicly available, fully fine-tuned Whisper model for Swiss German.

Model Description

  • Base model: openai/whisper-large-v3 (1.55B parameters)
  • Fine-tuning: Full fine-tune (all parameters trainable)
  • Training data: 1,367 hours of Swiss German speech from broadcast subtitles, parliamentary proceedings, YouTube, and Swiss film
  • Task: Swiss German speech → Standard German text (dialect-to-standard translation + transcription)
  • Hardware: NVIDIA DGX Spark GB10 (128 GB unified memory), single desktop workstation

Performance

Metric Value Notes
WER (measured) 25.60% ASGDTS, 5,750 samples, honest evaluation
cWER (content errors only) 13.8% Excludes style/convention differences
sWER (style component) 11.3% Valid alternative translations penalized by WER
bWER (bias-corrected) 8.5% Estimated true error rate
Whisper large-v3 baseline 28.56% Zero-shot, no fine-tuning

Important Context on WER

Our WER of 25.60% should be interpreted carefully:

  • ~64% of evaluation samples are semantically correct (KORREKT + STIL categories) but penalized by WER due to transcription convention differences (tense, reformulation style)
  • The genuine content error rate is 13.8% cWER; bias-corrected estimation yields 8.5% bWER
  • Published lower WER scores (Michaud 17.5%, ZHAW 17.1%) are inflated by benchmark contamination — see our paper for details

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "Flix-AI/flix-swiss-german-full"

processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

# Transcribe Swiss German audio
audio_array = ...  # numpy array, 16kHz mono
input_features = processor(
    audio_array, sampling_rate=16000, return_tensors="pt"
).input_features.to(model.device, dtype=torch.bfloat16)

predicted_ids = model.generate(input_features, language="de", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Details

Data Sources

Source Hours License Content
SRF Mediathek 848h Research use (Art. 24d URG) Broadcast subtitles (news, entertainment, documentary)
Swiss Parliament (SPC v2) 202h CC BY 4.0 Parliamentary speeches (Grosser Rat BE)
YouTube 151h Research use (Art. 24d URG) 25 institutional channels (cantons, police, podcasts)
PlaySuisse 165h Research use (Art. 24d URG) Swiss films and series
Total 1,367h

No training data is redistributed with this model. The model was trained under the Swiss text and data mining research exception (Art. 24d URG).

Training Configuration

Parameter Value
Trainable parameters 1,543,490,560 (100%)
Optimizer AdamW
Learning rate 1×10⁻⁵ (cosine decay)
Warmup steps 500
Effective batch size 32
Precision bfloat16
Gradient checkpointing Enabled
SpecAugment Enabled
Training time ~73 hours (2 epochs)

Dialect Coverage

The training data covers all major Swiss German dialect regions:

Dialect Primary Source
Züridütsch SRF, YouTube
Berndeutsch SPC v2 (dominant), SRF
Luzernerdeutsch SRF, YouTube
Baseldeutsch SRF, YouTube
St. Gallerdeutsch SRF, YouTube
Walliserdeutsch SRF, PlaySuisse
Bündnerdeutsch YouTube
Appenzellerdeutsch SRF

Limitations

  1. Proper nouns: The model may misspell names and places it hasn't encountered during training
  2. Word order: Swiss German sentence structure sometimes differs from Standard German; the model may produce valid but differently ordered translations
  3. Convention mismatch: Trained on broadcast subtitles (editorial style), which may differ from verbatim transcription expectations
  4. No context: The model processes segments independently; it cannot use broader conversation context for disambiguation

Citation

@article{akeret2026whisper-swiss-german,
  title={Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6\% WER (13.8\% cWER)},
  author={Akeret, Felix},
  year={2026},
  url={https://huggingface.co/Flix-AI/flix-swiss-german-full}
}

Acknowledgments

  • OpenAI for the Whisper model
  • FHNW/i4ds for the Swiss Parliament Corpus (SPC v2) and ASGDTS benchmark
  • SRF for publicly accessible broadcast content
  • PlaySuisse for Swiss film and series content
Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Flix-AI/flix-swiss-german-full

Evaluation results