File size: 8,348 Bytes
9a1d944
 
 
 
 
 
a673c6d
9a1d944
 
 
 
43a4312
a673c6d
 
 
 
 
 
 
 
 
 
f5d1ff1
 
a673c6d
f5d1ff1
a673c6d
f5d1ff1
a673c6d
f5d1ff1
a673c6d
 
 
3dc02eb
a673c6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a179153
a673c6d
a179153
a673c6d
 
 
 
 
 
 
 
 
 
 
 
 
 
3dc02eb
a673c6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cae6a66
 
 
 
 
 
 
 
 
 
 
a673c6d
 
 
 
 
3dc02eb
 
a673c6d
 
 
 
 
 
 
3dc02eb
 
 
 
 
 
 
 
 
 
 
 
a673c6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dc02eb
a673c6d
 
3dc02eb
a673c6d
 
 
 
 
 
 
 
ccd0f66
a673c6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3dc02eb
a673c6d
3dc02eb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
license: mit
language:
- fr
metrics:
- wer
- cer
base_model:
- UsefulSensors/moonshine-tiny
pipeline_tag: automatic-speech-recognition
library_name: transformers
arvix: https://arxiv.org/abs/2410.15608
datasets:
- facebook/multilingual_librispeech
tags:
- audio
- automatic-speech-recognition
- speech-to-text
- speech
- french
- moonshine
- asr
---

# Moonshine-Tiny-FR: French Speech Recognition Model

**Fine-tuned Moonshine ASR model for French language**

This is a fine-tuned version of [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) specifically optimized for French speech recognition. The model achieves state-of-the-art performance for its size (27M parameters) on French ASR tasks.

**Links:**
- [[Original Moonshine Blog]](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/)
- [[Original Paper]](https://arxiv.org/abs/2410.15608)
- [[Fine-Tuning Guide]](https://github.com/pierre-cheneau/finetune-moonshine-asr)

## Usage

### Installation
```bash
pip install --upgrade pip
pip install --upgrade transformers datasets[audio]
```

### Basic Usage

```python
from transformers import MoonshineForConditionalGeneration, AutoProcessor
import torch
import torchaudio

# Load model and processor
model = MoonshineForConditionalGeneration.from_pretrained('Cornebidouil/moonshine-tiny-fr')
processor = AutoProcessor.from_pretrained('Cornebidouil/moonshine-tiny-fr')

# Load and resample audio to 16kHz
audio, sr = torchaudio.load("french_audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)
audio = audio[0].numpy()  # Convert to mono

# Prepare inputs
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate transcription
# Calculate max_new_tokens to avoid truncation (5 tokens per second is optimal for French)
audio_duration = len(audio) / 16000
max_new_tokens = int(audio_duration * 5)

generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
transcription = processor.decode(generated_ids[0], skip_special_tokens=True)
print(transcription)
```

### Advanced Usage

For production deployments with:
- **Live transcription** with Voice Activity Detection
- **ONNX optimization** (20-30% faster)
- **Batch processing** scripts
- **Complete inference pipeline**

See the included [`inference.py`](https://github.com/pierre-cheneau/finetune-moonshine-asr/blob/main/scripts/inference.py) script in the [fine-tuning guide](https://github.com/pierre-cheneau/finetune-moonshine-asr).

## Model Details

### Model Description

- **Base Model:** [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny)
- **Language:** French (fr)
- **Model Size:** 27M parameters
- **Fine-tuned on:** Multilingual LibriSpeech (MLS) French dataset specifically segmented for the requirements of the moonshine model
- **Training Duration:** 8,000 steps
- **Optimizer:** Schedule-free AdamW
- **License:** MIT

### Model Architecture

Moonshine is a compact sequence-to-sequence ASR model designed for efficient on-device inference:
- **Encoder:** Convolutional feature extraction + Transformer blocks
- **Decoder:** Autoregressive Transformer decoder
- **Parameters:** 27M (tiny variant)
- **Input:** 16kHz mono audio
- **Output:** French text transcription

## Performance

### Evaluation Metrics

Evaluated on Multilingual LibriSpeech (MLS) French test set:

| Metric | Score |
|--------|-------|
| **Word Error Rate (WER)** | 21.8% |
| **Character Error Rate (CER)** | ~10% |
| **Real-Time Factor (RTF)** | 0.11x (CPU) |

**Inference Speed:** ~9x faster than real-time on CPU, enabling live transcription.

### Comparison

| Model | Size | Language | WER (MLS-FR) |
|-------|------|----------|--------------|
| Whisper-tiny | 39M | Multilingual | ~25% |
| **Moonshine-tiny-fr** | 27M | French | **21.8%** |
| Whisper-base | 74M | Multilingual | ~18% |

*Moonshine-tiny-fr achieves competitive performance with 30% fewer parameters than Whisper-tiny. While being a proof of concept. More work should be done to create a proper and robust dataset.*

## Training Details / Fine tuning

Please refer to my Github repo for the training procedure : 

## Use Cases

### Primary Applications**French Speech Recognition**
- Real-time transcription
- Audio file transcription
- Voice commands
- Accessibility tools

✅ **Resource-Constrained Environments**
- On-device transcription (mobile, edge devices)
- Low-latency applications
- Offline transcription

✅ **Hogwarts Legacy SpellCaster**
- Ultra-lightweight and low latency spell speech recognition
- https://github.com/pierre-cheneau/HogwartsLegacy-SpellCaster

## Limitations and Biases

### Known Limitations for this tiny model

- **Hallucination:** Like all seq2seq models, may generate text not present in audio
- **Repetition:** May repeat phrases, especially with greedy decoding (use beam search)
- **Short Segments:** Performance may degrade on very short audio clips (<0.5s)
- **Domain Specificity:** Trained primarily on audiobooks (read speech)
- **Accents:** Best performance on metropolitan French; regional accents may have higher WER
- **Background Noise:** Performance degrades with significant background noise

## Model Card Author

**Pierre Chéneau (Cornebidouil)**

Geologist, Developer and maintainer of this fine-tuned French model.

**Links:**
- 🌐 [Personal Website](https://pcheneau.fr)
- 💼 [GitHub](https://github.com/pierre-cheneau)
- 📚 [Fine-tuning Guide](https://github.com/pierre-cheneau/finetune-moonshine-asr)

## Citations

### This Model

```bibtex
@misc{cheneau2026moonshine-tiny-fr,
  author = {Pierre Chéneau (Cornebidouil)},
  title = {Moonshine-Tiny-FR: Fine-tuned French Speech Recognition},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Cornebidouil/moonshine-tiny-fr}
}
```

### Fine tuning Guide

```bibtex
@misc{cheneau2026moonshine-finetune,
  author = {Pierre Chéneau (Cornebidouil)},
  title = {Moonshine ASR Fine-Tuning Guide},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/pierre-cheneau/finetune-moonshine-asr}
}
```

### Original Moonshine Model

```bibtex
@misc{jeffries2024moonshinespeechrecognitionlive,
      title={Moonshine: Speech Recognition for Live Transcription and Voice Commands},
      author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden},
      year={2024},
      eprint={2410.15608},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2410.15608},
}
```

### Multilingual LibriSpeech Dataset

```bibtex
@inproceedings{panayotov2015librispeech,
  title={Multilingual LibriSpeech: A Corpus for Speech Recognition in Multiple Languages},
  author={Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and Synnaeve, Gabriel and Collobert, Ronan},
  booktitle={Interspeech},
  year={2020}
}
```

## Additional Resources

- **Fine-Tuning Guide:** [Complete tutorial](https://github.com/pierre-cheneau/finetune-moonshine-asr)
- **Original Moonshine:** [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny)
- **Dataset:** [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
- **Issues/Support:** [GitHub Issues](https://github.com/pierre-cheneau/finetune-moonshine-asr/issues)

## License

This model is released under the MIT License, consistent with the base Moonshine model.

```
MIT License

Copyright (c) 2026 Pierre Chéneau (Cornebidouil)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
```

## Acknowledgments

- **Useful Sensors** for the original Moonshine architecture and pre-trained model
- **Meta AI** for the Multilingual LibriSpeech dataset
- **HuggingFace** for the transformers library and model hosting
- **Schedule-Free Learning** for the optimizer implementation

---

**Questions?** Open an issue on the [fine-tuning guide repository](https://github.com/pierre-cheneau/finetune-moonshine-asr) or check the documentation.

**Want to fine-tune for your language?** See the [complete fine-tuning guide](https://github.com/pierre-cheneau/finetune-moonshine-asr).