moonshine-tiny-de / README.md
dattazigzag's picture
Upload folder using huggingface_hub
669f0bf verified
---
license: mit
language:
- de
tags:
- automatic-speech-recognition
- moonshine
- german
- asr
- speech
datasets:
- facebook/multilingual_librispeech
metrics:
- wer
base_model: UsefulSensors/moonshine-tiny
model-index:
- name: moonshine-tiny-de
results:
- task:
type: automatic-speech-recognition
dataset:
name: MLS German (test split)
type: facebook/multilingual_librispeech
args: german
metrics:
- name: WER
type: wer
value: 36.7
---
# Moonshine-Tiny-DE: Fine-tuned German Speech Recognition
Fine-tuned [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) for German automatic speech recognition.
## Model Details
- **Base model:** UsefulSensors/moonshine-tiny (27M parameters)
- **Language:** German (de)
- **Training data:** MLS German — 469,942 samples (~1,967 hours of audiobook speech)
- **WER:** 36.7% on MLS German test set (3,394 samples)
- **Training:** 10,000 steps, schedule-free AdamW, bf16, effective batch size 64
- **Hardware:** Single NVIDIA RTX 5090 (32 GB), ~9.7 hours
## Usage
```python
from transformers import pipeline
transcriber = pipeline("automatic-speech-recognition", model="dattazigzag/moonshine-tiny-de")
result = transcriber("german_audio.wav")
print(result["text"])
```
### Batch processing
```python
from pathlib import Path
audio_files = Path("./audio").glob("*.wav")
for audio in audio_files:
result = transcriber(str(audio))
print(f"{audio.name}: {result['text']}")
```
### With explicit model loading
```python
from transformers import AutoProcessor, MoonshineForConditionalGeneration
import torch
model = MoonshineForConditionalGeneration.from_pretrained("dattazigzag/moonshine-tiny-de")
processor = AutoProcessor.from_pretrained("dattazigzag/moonshine-tiny-de")
model.eval()
# Process audio (16kHz mono WAV)
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=80)
text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
```
## Training Details
### Approach
This is **not** trained from scratch. We fine-tuned the English-only moonshine-tiny model to understand German. The pre-trained model already knew audio feature extraction, attention patterns, and tokenization — we adapted it to German phonetics and vocabulary.
### Configuration
| Setting | Value |
|---------|-------|
| Optimizer | schedule-free AdamW |
| Learning rate | 3e-4 (constant after 300-step warmup) |
| Precision | bf16 |
| Batch size | 16 per device × 4 accumulation = 64 effective |
| Audio duration | 4–20 seconds |
| Gradient checkpointing | Disabled (broken with Moonshine in transformers 4.49) |
| Curriculum learning | Disabled (simple first run) |
### Training curve
| Step | Loss | WER |
|------|------|-----|
| 500 | 2.37 | — |
| 1,000 | 2.04 | 46.5% |
| 5,000 | ~1.65 | ~39% |
| 10,000 | 1.61 | **36.7%** |
### Error patterns
- Phonetically similar confusions: b/p, d/t, ck/x (classic German ASR challenges)
- Compound word splitting errors: "herzaubern" → "herr sauben"
- Longer sequences degrade more than shorter ones
- Audiobook speech only — no conversational speech exposure
## Limitations
- **Audiobook speech only** — trained on MLS (read speech). May underperform on conversational, noisy, or accented German.
- **First training run** — WER can likely be improved with curriculum learning, more training steps, or additional data sources (SWC, VoxPopuli, Bundestag).
- **No Common Voice data** — Mozilla pulled it from HuggingFace in Oct 2025, so we lack speaker diversity.
- **HuggingFace transformers only** — produces safetensors format, not the `.ort` format for the native `moonshine-voice` CLI. ONNX conversion is a planned next step.
## Fine-tuning toolkit
Trained using a fork of [Pierre Chéneau's finetune-moonshine-asr](https://github.com/pierre-cheneau/finetune-moonshine-asr) with German-specific adaptations:
- [Training config](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/configs/mls_cv_german_no_curriculum.yaml)
- [Data preparation script](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/scripts/prepare_german_dataset.py)
- [Full context & gotchas](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/contexts/moonshine_de_context.md)
## Acknowledgments
- [Moonshine AI / Useful Sensors](https://github.com/moonshine-ai/moonshine) for the base model
- [Pierre Chéneau](https://github.com/pierre-cheneau/finetune-moonshine-asr) for the fine-tuning toolkit and [moonshine-tiny-fr](https://huggingface.co/Cornebidouil/moonshine-tiny-fr) (21.8% WER French reference)
- [German language support community (issue #141)](https://github.com/moonshine-ai/moonshine/issues/141)
## Citation
```bibtex
@misc{datta2026moonshine-tiny-de,
author = {Saurabh Datta},
title = {Moonshine-Tiny-DE: Fine-tuned German Speech Recognition},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/dattazigzag/moonshine-tiny-de}
}
```