dattazigzag commited on
Commit
669f0bf
·
verified ·
1 Parent(s): 2e5b260

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - de
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - moonshine
8
+ - german
9
+ - asr
10
+ - speech
11
+ datasets:
12
+ - facebook/multilingual_librispeech
13
+ metrics:
14
+ - wer
15
+ base_model: UsefulSensors/moonshine-tiny
16
+ model-index:
17
+ - name: moonshine-tiny-de
18
+ results:
19
+ - task:
20
+ type: automatic-speech-recognition
21
+ dataset:
22
+ name: MLS German (test split)
23
+ type: facebook/multilingual_librispeech
24
+ args: german
25
+ metrics:
26
+ - name: WER
27
+ type: wer
28
+ value: 36.7
29
+ ---
30
+
31
+ # Moonshine-Tiny-DE: Fine-tuned German Speech Recognition
32
+
33
+ Fine-tuned [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) for German automatic speech recognition.
34
+
35
+ ## Model Details
36
+
37
+ - **Base model:** UsefulSensors/moonshine-tiny (27M parameters)
38
+ - **Language:** German (de)
39
+ - **Training data:** MLS German — 469,942 samples (~1,967 hours of audiobook speech)
40
+ - **WER:** 36.7% on MLS German test set (3,394 samples)
41
+ - **Training:** 10,000 steps, schedule-free AdamW, bf16, effective batch size 64
42
+ - **Hardware:** Single NVIDIA RTX 5090 (32 GB), ~9.7 hours
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ from transformers import pipeline
48
+
49
+ transcriber = pipeline("automatic-speech-recognition", model="dattazigzag/moonshine-tiny-de")
50
+ result = transcriber("german_audio.wav")
51
+ print(result["text"])
52
+ ```
53
+
54
+ ### Batch processing
55
+
56
+ ```python
57
+ from pathlib import Path
58
+
59
+ audio_files = Path("./audio").glob("*.wav")
60
+ for audio in audio_files:
61
+ result = transcriber(str(audio))
62
+ print(f"{audio.name}: {result['text']}")
63
+ ```
64
+
65
+ ### With explicit model loading
66
+
67
+ ```python
68
+ from transformers import AutoProcessor, MoonshineForConditionalGeneration
69
+ import torch
70
+
71
+ model = MoonshineForConditionalGeneration.from_pretrained("dattazigzag/moonshine-tiny-de")
72
+ processor = AutoProcessor.from_pretrained("dattazigzag/moonshine-tiny-de")
73
+ model.eval()
74
+
75
+ # Process audio (16kHz mono WAV)
76
+ inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
77
+ with torch.no_grad():
78
+ generated_ids = model.generate(**inputs, max_new_tokens=80)
79
+ text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=True)
80
+ ```
81
+
82
+ ## Training Details
83
+
84
+ ### Approach
85
+
86
+ This is **not** trained from scratch. We fine-tuned the English-only moonshine-tiny model to understand German. The pre-trained model already knew audio feature extraction, attention patterns, and tokenization — we adapted it to German phonetics and vocabulary.
87
+
88
+ ### Configuration
89
+
90
+ | Setting | Value |
91
+ |---------|-------|
92
+ | Optimizer | schedule-free AdamW |
93
+ | Learning rate | 3e-4 (constant after 300-step warmup) |
94
+ | Precision | bf16 |
95
+ | Batch size | 16 per device × 4 accumulation = 64 effective |
96
+ | Audio duration | 4–20 seconds |
97
+ | Gradient checkpointing | Disabled (broken with Moonshine in transformers 4.49) |
98
+ | Curriculum learning | Disabled (simple first run) |
99
+
100
+ ### Training curve
101
+
102
+ | Step | Loss | WER |
103
+ |------|------|-----|
104
+ | 500 | 2.37 | — |
105
+ | 1,000 | 2.04 | 46.5% |
106
+ | 5,000 | ~1.65 | ~39% |
107
+ | 10,000 | 1.61 | **36.7%** |
108
+
109
+ ### Error patterns
110
+
111
+ - Phonetically similar confusions: b/p, d/t, ck/x (classic German ASR challenges)
112
+ - Compound word splitting errors: "herzaubern" → "herr sauben"
113
+ - Longer sequences degrade more than shorter ones
114
+ - Audiobook speech only — no conversational speech exposure
115
+
116
+ ## Limitations
117
+
118
+ - **Audiobook speech only** — trained on MLS (read speech). May underperform on conversational, noisy, or accented German.
119
+ - **First training run** — WER can likely be improved with curriculum learning, more training steps, or additional data sources (SWC, VoxPopuli, Bundestag).
120
+ - **No Common Voice data** — Mozilla pulled it from HuggingFace in Oct 2025, so we lack speaker diversity.
121
+ - **HuggingFace transformers only** — produces safetensors format, not the `.ort` format for the native `moonshine-voice` CLI. ONNX conversion is a planned next step.
122
+
123
+ ## Fine-tuning toolkit
124
+
125
+ Trained using a fork of [Pierre Chéneau's finetune-moonshine-asr](https://github.com/pierre-cheneau/finetune-moonshine-asr) with German-specific adaptations:
126
+
127
+ - [Training config](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/configs/mls_cv_german_no_curriculum.yaml)
128
+ - [Data preparation script](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/scripts/prepare_german_dataset.py)
129
+ - [Full context & gotchas](https://github.com/zigzagGmbH/finetune-moonshine-asr/blob/main/contexts/moonshine_de_context.md)
130
+
131
+ ## Acknowledgments
132
+
133
+ - [Moonshine AI / Useful Sensors](https://github.com/moonshine-ai/moonshine) for the base model
134
+ - [Pierre Chéneau](https://github.com/pierre-cheneau/finetune-moonshine-asr) for the fine-tuning toolkit and [moonshine-tiny-fr](https://huggingface.co/Cornebidouil/moonshine-tiny-fr) (21.8% WER French reference)
135
+ - [German language support community (issue #141)](https://github.com/moonshine-ai/moonshine/issues/141)
136
+
137
+ ## Citation
138
+
139
+ ```bibtex
140
+ @misc{datta2026moonshine-tiny-de,
141
+ author = {Saurabh Datta},
142
+ title = {Moonshine-Tiny-DE: Fine-tuned German Speech Recognition},
143
+ year = {2026},
144
+ publisher = {HuggingFace},
145
+ url = {https://huggingface.co/dattazigzag/moonshine-tiny-de}
146
+ }
147
+ ```
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "UsefulSensors/moonshine-tiny",
3
+ "architectures": [
4
+ "MoonshineForConditionalGeneration"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 1,
9
+ "decoder_hidden_act": "silu",
10
+ "decoder_num_attention_heads": 8,
11
+ "decoder_num_hidden_layers": 6,
12
+ "decoder_num_key_value_heads": 8,
13
+ "decoder_start_token_id": 1,
14
+ "encoder_hidden_act": "gelu",
15
+ "encoder_num_attention_heads": 8,
16
+ "encoder_num_hidden_layers": 6,
17
+ "encoder_num_key_value_heads": 8,
18
+ "eos_token_id": 2,
19
+ "hidden_size": 288,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 1152,
22
+ "is_encoder_decoder": true,
23
+ "max_position_embeddings": 194,
24
+ "model_type": "moonshine",
25
+ "pad_head_dim_to_multiple_of": 8,
26
+ "pad_token_id": 2,
27
+ "partial_rotary_factor": 0.9,
28
+ "rope_scaling": null,
29
+ "rope_theta": 10000.0,
30
+ "torch_dtype": "float32",
31
+ "transformers_version": "4.49.0",
32
+ "use_cache": false,
33
+ "vocab_size": 32768
34
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "decoder_start_token_id": 1,
5
+ "early_stopping": true,
6
+ "eos_token_id": 2,
7
+ "length_penalty": 1.2,
8
+ "max_length": 194,
9
+ "no_repeat_ngram_size": 2,
10
+ "num_beams": 5,
11
+ "pad_token_id": 2,
12
+ "repetition_penalty": 1.2,
13
+ "transformers_version": "4.49.0"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d6b8b2b6000bc3cb9ced7a3a5341de62e8689d5b50d2d7d17e6bfce93ea39a5
3
+ size 108389192
preprocessor_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": false,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0.0,
7
+ "processor_class": "Wav2Vec2Processor",
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 16000
10
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19d2560fe6bf2bee833189dd8686745cbe25f3f0ef0bc843715b5bcdd94c5bf4
3
+ size 5905