--- license: mit language: - de tags: - automatic-speech-recognition - moonshine - german - asr - speech datasets: - facebook/multilingual_librispeech metrics: - wer base_model: UsefulSensors/moonshine-tiny model-index: - name: moonshine-tiny-de results: - task: type: automatic-speech-recognition dataset: name: MLS German (test split) type: facebook/multilingual_librispeech args: german metrics: - name: WER type: wer value: 36.7 --- # Moonshine-Tiny-DE: Fine-tuned German Speech Recognition Fine-tuned [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) for German automatic speech recognition. ## Model Details - **Base model:** UsefulSensors/moonshine-tiny (27M parameters) - **Language:** German (de) - **Training data:** MLS German — 469,942 samples (~1,967 hours of audiobook speech) - **WER:** 36.7% on MLS German test set (3,394 samples) - **Training:** 10,000 steps, schedule-free AdamW, bf16, effective batch size 64 - **Hardware:** Single NVIDIA RTX 5090 (32 GB), ~9.7 hours ## Usage ```python from transformers import pipeline transcriber = pipeline("automatic-speech-recognition", model="dattazigzag/moonshine-tiny-de") result = transcriber("german_audio.wav") print(result["text"]) ``` ### Batch processing ```python from pathlib import Path audio_files = Path("./audio").glob("*.wav") for audio in audio_files: result = transcriber(str(audio)) print(f"{audio.name}: {result['text']}") ``` ### With explicit model loading ```python from transformers import AutoProcessor, MoonshineForConditionalGeneration import torch model = MoonshineForConditionalGeneration.from_pretrained("dattazigzag/moonshine-tiny-de") processor = AutoProcessor.from_pretrained("dattazigzag/moonshine-tiny-de") model.eval() # Process audio (16kHz mono WAV) inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt") with torch.no_grad(): generated_ids = model.generate(**inputs, max_new_tokens=80) text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True) ``` ## Training Details ### Approach This is **not** trained from scratch. We fine-tuned the English-only moonshine-tiny model to understand German. The pre-trained model already knew audio feature extraction, attention patterns, and tokenization — we adapted it to German phonetics and vocabulary. ### Configuration | Setting | Value | |---------|-------| | Optimizer | schedule-free AdamW | | Learning rate | 3e-4 (constant after 300-step warmup) | | Precision | bf16 | | Batch size | 16 per device × 4 accumulation = 64 effective | | Audio duration | 4–20 seconds | | Gradient checkpointing | Disabled (broken with Moonshine in transformers 4.49) | | Curriculum learning | Disabled (simple first run) | ### Training curve | Step | Loss | WER | |------|------|-----| | 500 | 2.37 | — | | 1,000 | 2.04 | 46.5% | | 5,000 | ~1.65 | ~39% | | 10,000 | 1.61 | **36.7%** | ### Error patterns - Phonetically similar confusions: b/p, d/t, ck/x (classic German ASR challenges) - Compound word splitting errors: "herzaubern" → "herr sauben" - Longer sequences degrade more than shorter ones - Audiobook speech only — no conversational speech exposure ## Limitations - **Audiobook speech only** — trained on MLS (read speech). May underperform on conversational, noisy, or accented German. - **First training run** — WER can likely be improved with curriculum learning, more training steps, or additional data sources (SWC, VoxPopuli, Bundestag). - **No Common Voice data** — Mozilla pulled it from HuggingFace in Oct 2025, so we lack speaker diversity. - **HuggingFace transformers only** — produces safetensors format, not the `.ort` format for the native `moonshine-voice` CLI. ONNX conversion is a planned next step. ## Fine-tuning toolkit Trained using a fork of [Pierre Chéneau's finetune-moonshine-asr](https://github.com/pierre-cheneau/finetune-moonshine-asr) with German-specific adaptations: - [Training config](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/configs/mls_cv_german_no_curriculum.yaml) - [Data preparation script](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/scripts/prepare_german_dataset.py) - [Full context & gotchas](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/contexts/moonshine_de_context.md) ## Acknowledgments - [Moonshine AI / Useful Sensors](https://github.com/moonshine-ai/moonshine) for the base model - [Pierre Chéneau](https://github.com/pierre-cheneau/finetune-moonshine-asr) for the fine-tuning toolkit and [moonshine-tiny-fr](https://huggingface.co/Cornebidouil/moonshine-tiny-fr) (21.8% WER French reference) - [German language support community (issue #141)](https://github.com/moonshine-ai/moonshine/issues/141) ## Citation ```bibtex @misc{datta2026moonshine-tiny-de, author = {Saurabh Datta}, title = {Moonshine-Tiny-DE: Fine-tuned German Speech Recognition}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/dattazigzag/moonshine-tiny-de} } ```