Upload folder using huggingface_hub

669f0bf verified 20 days ago

5.1 kB

	---
	license: mit
	language:
	- de
	tags:
	- automatic-speech-recognition
	- moonshine
	- german
	- asr
	- speech
	datasets:
	- facebook/multilingual_librispeech
	metrics:
	- wer
	base_model: UsefulSensors/moonshine-tiny
	model-index:
	- name: moonshine-tiny-de
	results:
	- task:
	type: automatic-speech-recognition
	dataset:
	name: MLS German (test split)
	type: facebook/multilingual_librispeech
	args: german
	metrics:
	- name: WER
	type: wer
	value: 36.7
	---

	# Moonshine-Tiny-DE: Fine-tuned German Speech Recognition

	Fine-tuned [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) for German automatic speech recognition.

	## Model Details

	- Base model: UsefulSensors/moonshine-tiny (27M parameters)
	- Language: German (de)
	- Training data: MLS German — 469,942 samples (~1,967 hours of audiobook speech)
	- WER: 36.7% on MLS German test set (3,394 samples)
	- Training: 10,000 steps, schedule-free AdamW, bf16, effective batch size 64
	- Hardware: Single NVIDIA RTX 5090 (32 GB), ~9.7 hours

	## Usage

	```python
	from transformers import pipeline

	transcriber = pipeline("automatic-speech-recognition", model="dattazigzag/moonshine-tiny-de")
	result = transcriber("german_audio.wav")
	print(result["text"])
	```

	### Batch processing

	```python
	from pathlib import Path

	audio_files = Path("./audio").glob("*.wav")
	for audio in audio_files:
	result = transcriber(str(audio))
	print(f"{audio.name}: {result['text']}")
	```

	### With explicit model loading

	```python
	from transformers import AutoProcessor, MoonshineForConditionalGeneration
	import torch

	model = MoonshineForConditionalGeneration.from_pretrained("dattazigzag/moonshine-tiny-de")
	processor = AutoProcessor.from_pretrained("dattazigzag/moonshine-tiny-de")
	model.eval()

	# Process audio (16kHz mono WAV)
	inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
	with torch.no_grad():
	generated_ids = model.generate(**inputs, max_new_tokens=80)
	text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
	```

	## Training Details

	### Approach

	This is not trained from scratch. We fine-tuned the English-only moonshine-tiny model to understand German. The pre-trained model already knew audio feature extraction, attention patterns, and tokenization — we adapted it to German phonetics and vocabulary.

	### Configuration

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Optimizer \| schedule-free AdamW \|
	\| Learning rate \| 3e-4 (constant after 300-step warmup) \|
	\| Precision \| bf16 \|
	\| Batch size \| 16 per device × 4 accumulation = 64 effective \|
	\| Audio duration \| 4–20 seconds \|
	\| Gradient checkpointing \| Disabled (broken with Moonshine in transformers 4.49) \|
	\| Curriculum learning \| Disabled (simple first run) \|

	### Training curve

	\| Step \| Loss \| WER \|
	\|------\|------\|-----\|
	\| 500 \| 2.37 \| — \|
	\| 1,000 \| 2.04 \| 46.5% \|
	\| 5,000 \| ~1.65 \| ~39% \|
	\| 10,000 \| 1.61 \| 36.7% \|

	### Error patterns

	- Phonetically similar confusions: b/p, d/t, ck/x (classic German ASR challenges)
	- Compound word splitting errors: "herzaubern" → "herr sauben"
	- Longer sequences degrade more than shorter ones
	- Audiobook speech only — no conversational speech exposure

	## Limitations

	- Audiobook speech only — trained on MLS (read speech). May underperform on conversational, noisy, or accented German.
	- First training run — WER can likely be improved with curriculum learning, more training steps, or additional data sources (SWC, VoxPopuli, Bundestag).
	- No Common Voice data — Mozilla pulled it from HuggingFace in Oct 2025, so we lack speaker diversity.
	- HuggingFace transformers only — produces safetensors format, not the `.ort` format for the native `moonshine-voice` CLI. ONNX conversion is a planned next step.

	## Fine-tuning toolkit

	Trained using a fork of [Pierre Chéneau's finetune-moonshine-asr](https://github.com/pierre-cheneau/finetune-moonshine-asr) with German-specific adaptations:

	- [Training config](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/configs/mls_cv_german_no_curriculum.yaml)
	- [Data preparation script](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/scripts/prepare_german_dataset.py)
	- [Full context & gotchas](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/contexts/moonshine_de_context.md)

	## Acknowledgments

	- [Moonshine AI / Useful Sensors](https://github.com/moonshine-ai/moonshine) for the base model
	- [Pierre Chéneau](https://github.com/pierre-cheneau/finetune-moonshine-asr) for the fine-tuning toolkit and [moonshine-tiny-fr](https://huggingface.co/Cornebidouil/moonshine-tiny-fr) (21.8% WER French reference)
	- [German language support community (issue #141)](https://github.com/moonshine-ai/moonshine/issues/141)

	## Citation

	```bibtex
	@misc{datta2026moonshine-tiny-de,
	author = {Saurabh Datta},
	title = {Moonshine-Tiny-DE: Fine-tuned German Speech Recognition},
	year = {2026},
	publisher = {HuggingFace},
	url = {https://huggingface.co/dattazigzag/moonshine-tiny-de}
	}
	```