Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +147 -0
config.json +34 -0
generation_config.json +14 -0
model.safetensors +3 -0
preprocessor_config.json +10 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer_config.json +0 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,147 @@

+---
+license: mit
+language:
+- de
+tags:
+- automatic-speech-recognition
+- moonshine
+- german
+- asr
+- speech
+datasets:
+- facebook/multilingual_librispeech
+metrics:
+- wer
+base_model: UsefulSensors/moonshine-tiny
+model-index:
+- name: moonshine-tiny-de
+  results:
+  - task:
+      type: automatic-speech-recognition
+    dataset:
+      name: MLS German (test split)
+      type: facebook/multilingual_librispeech
+      args: german
+    metrics:
+    - name: WER
+      type: wer
+      value: 36.7
+---
+# Moonshine-Tiny-DE: Fine-tuned German Speech Recognition
+Fine-tuned [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) for German automatic speech recognition.
+## Model Details
+- **Base model:** UsefulSensors/moonshine-tiny (27M parameters)
+- **Language:** German (de)
+- **Training data:** MLS German — 469,942 samples (~1,967 hours of audiobook speech)
+- **WER:** 36.7% on MLS German test set (3,394 samples)
+- **Training:** 10,000 steps, schedule-free AdamW, bf16, effective batch size 64
+- **Hardware:** Single NVIDIA RTX 5090 (32 GB), ~9.7 hours
+## Usage
+```python
+from transformers import pipeline
+transcriber = pipeline("automatic-speech-recognition", model="dattazigzag/moonshine-tiny-de")
+result = transcriber("german_audio.wav")
+print(result["text"])
+```
+### Batch processing
+```python
+from pathlib import Path
+audio_files = Path("./audio").glob("*.wav")
+for audio in audio_files:
+    result = transcriber(str(audio))
+    print(f"{audio.name}: {result['text']}")
+```
+### With explicit model loading
+```python
+from transformers import AutoProcessor, MoonshineForConditionalGeneration
+import torch
+model = MoonshineForConditionalGeneration.from_pretrained("dattazigzag/moonshine-tiny-de")
+processor = AutoProcessor.from_pretrained("dattazigzag/moonshine-tiny-de")
+model.eval()
+# Process audio (16kHz mono WAV)
+inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
+with torch.no_grad():
+    generated_ids = model.generate(**inputs, max_new_tokens=80)
+text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
+```
+## Training Details
+### Approach
+This is **not** trained from scratch. We fine-tuned the English-only moonshine-tiny model to understand German. The pre-trained model already knew audio feature extraction, attention patterns, and tokenization — we adapted it to German phonetics and vocabulary.
+### Configuration
+| Setting | Value |
+|---------|-------|
+| Optimizer | schedule-free AdamW |
+| Learning rate | 3e-4 (constant after 300-step warmup) |
+| Precision | bf16 |
+| Batch size | 16 per device × 4 accumulation = 64 effective |
+| Audio duration | 4–20 seconds |
+| Gradient checkpointing | Disabled (broken with Moonshine in transformers 4.49) |
+| Curriculum learning | Disabled (simple first run) |
+### Training curve
+| Step | Loss | WER |
+|------|------|-----|
+| 500 | 2.37 | — |
+| 1,000 | 2.04 | 46.5% |
+| 5,000 | ~1.65 | ~39% |
+| 10,000 | 1.61 | **36.7%** |
+### Error patterns
+- Phonetically similar confusions: b/p, d/t, ck/x (classic German ASR challenges)
+- Compound word splitting errors: "herzaubern" → "herr sauben"
+- Longer sequences degrade more than shorter ones
+- Audiobook speech only — no conversational speech exposure
+## Limitations
+- **Audiobook speech only** — trained on MLS (read speech). May underperform on conversational, noisy, or accented German.
+- **First training run** — WER can likely be improved with curriculum learning, more training steps, or additional data sources (SWC, VoxPopuli, Bundestag).
+- **No Common Voice data** — Mozilla pulled it from HuggingFace in Oct 2025, so we lack speaker diversity.
+- **HuggingFace transformers only** — produces safetensors format, not the `.ort` format for the native `moonshine-voice` CLI. ONNX conversion is a planned next step.
+## Fine-tuning toolkit
+Trained using a fork of [Pierre Chéneau's finetune-moonshine-asr](https://github.com/pierre-cheneau/finetune-moonshine-asr) with German-specific adaptations:
+- [Training config](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/configs/mls_cv_german_no_curriculum.yaml)
+- [Data preparation script](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/scripts/prepare_german_dataset.py)
+- [Full context & gotchas](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/contexts/moonshine_de_context.md)
+## Acknowledgments
+- [Moonshine AI / Useful Sensors](https://github.com/moonshine-ai/moonshine) for the base model
+- [Pierre Chéneau](https://github.com/pierre-cheneau/finetune-moonshine-asr) for the fine-tuning toolkit and [moonshine-tiny-fr](https://huggingface.co/Cornebidouil/moonshine-tiny-fr) (21.8% WER French reference)
+- [German language support community (issue #141)](https://github.com/moonshine-ai/moonshine/issues/141)
+## Citation
+```bibtex
+@misc{datta2026moonshine-tiny-de,
+  author = {Saurabh Datta},
+  title = {Moonshine-Tiny-DE: Fine-tuned German Speech Recognition},
+  year = {2026},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/dattazigzag/moonshine-tiny-de}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "_name_or_path": "UsefulSensors/moonshine-tiny",
+  "architectures": [
+    "MoonshineForConditionalGeneration"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "decoder_hidden_act": "silu",
+  "decoder_num_attention_heads": 8,
+  "decoder_num_hidden_layers": 6,
+  "decoder_num_key_value_heads": 8,
+  "decoder_start_token_id": 1,
+  "encoder_hidden_act": "gelu",
+  "encoder_num_attention_heads": 8,
+  "encoder_num_hidden_layers": 6,
+  "encoder_num_key_value_heads": 8,
+  "eos_token_id": 2,
+  "hidden_size": 288,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "is_encoder_decoder": true,
+  "max_position_embeddings": 194,
+  "model_type": "moonshine",
+  "pad_head_dim_to_multiple_of": 8,
+  "pad_token_id": 2,
+  "partial_rotary_factor": 0.9,
+  "rope_scaling": null,
+  "rope_theta": 10000.0,
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "use_cache": false,
+  "vocab_size": 32768
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "decoder_start_token_id": 1,
+  "early_stopping": true,
+  "eos_token_id": 2,
+  "length_penalty": 1.2,
+  "max_length": 194,
+  "no_repeat_ngram_size": 2,
+  "num_beams": 5,
+  "pad_token_id": 2,
+  "repetition_penalty": 1.2,
+  "transformers_version": "4.49.0"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4d6b8b2b6000bc3cb9ced7a3a5341de62e8689d5b50d2d7d17e6bfce93ea39a5
+size 108389192

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "do_normalize": false,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "Wav2Vec2Processor",
+  "return_attention_mask": true,
+  "sampling_rate": 16000
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:19d2560fe6bf2bee833189dd8686745cbe25f3f0ef0bc843715b5bcdd94c5bf4
+size 5905