--- license: mit language: - fr metrics: - wer - cer base_model: - UsefulSensors/moonshine-tiny pipeline_tag: automatic-speech-recognition library_name: transformers arvix: https://arxiv.org/abs/2410.15608 datasets: - facebook/multilingual_librispeech tags: - audio - automatic-speech-recognition - speech-to-text - speech - french - moonshine - asr --- # Moonshine-Tiny-FR: French Speech Recognition Model **Fine-tuned Moonshine ASR model for French language** This is a fine-tuned version of [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) specifically optimized for French speech recognition. The model achieves state-of-the-art performance for its size (27M parameters) on French ASR tasks. **Links:** - [[Original Moonshine Blog]](https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/) - [[Original Paper]](https://arxiv.org/abs/2410.15608) - [[Fine-Tuning Guide]](https://github.com/pierre-cheneau/finetune-moonshine-asr) ## Usage ### Installation ```bash pip install --upgrade pip pip install --upgrade transformers datasets[audio] ``` ### Basic Usage ```python from transformers import MoonshineForConditionalGeneration, AutoProcessor import torch import torchaudio # Load model and processor model = MoonshineForConditionalGeneration.from_pretrained('Cornebidouil/moonshine-tiny-fr') processor = AutoProcessor.from_pretrained('Cornebidouil/moonshine-tiny-fr') # Load and resample audio to 16kHz audio, sr = torchaudio.load("french_audio.wav") if sr != 16000: audio = torchaudio.functional.resample(audio, sr, 16000) audio = audio[0].numpy() # Convert to mono # Prepare inputs inputs = processor(audio, sampling_rate=16000, return_tensors="pt") # Generate transcription # Calculate max_new_tokens to avoid truncation (5 tokens per second is optimal for French) audio_duration = len(audio) / 16000 max_new_tokens = int(audio_duration * 5) generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens) transcription = processor.decode(generated_ids[0], skip_special_tokens=True) print(transcription) ``` ### Advanced Usage For production deployments with: - **Live transcription** with Voice Activity Detection - **ONNX optimization** (20-30% faster) - **Batch processing** scripts - **Complete inference pipeline** See the included [`inference.py`](https://github.com/pierre-cheneau/finetune-moonshine-asr/blob/main/scripts/inference.py) script in the [fine-tuning guide](https://github.com/pierre-cheneau/finetune-moonshine-asr). ## Model Details ### Model Description - **Base Model:** [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) - **Language:** French (fr) - **Model Size:** 27M parameters - **Fine-tuned on:** Multilingual LibriSpeech (MLS) French dataset specifically segmented for the requirements of the moonshine model - **Training Duration:** 8,000 steps - **Optimizer:** Schedule-free AdamW - **License:** MIT ### Model Architecture Moonshine is a compact sequence-to-sequence ASR model designed for efficient on-device inference: - **Encoder:** Convolutional feature extraction + Transformer blocks - **Decoder:** Autoregressive Transformer decoder - **Parameters:** 27M (tiny variant) - **Input:** 16kHz mono audio - **Output:** French text transcription ## Performance ### Evaluation Metrics Evaluated on Multilingual LibriSpeech (MLS) French test set: | Metric | Score | |--------|-------| | **Word Error Rate (WER)** | 21.8% | | **Character Error Rate (CER)** | ~10% | | **Real-Time Factor (RTF)** | 0.11x (CPU) | **Inference Speed:** ~9x faster than real-time on CPU, enabling live transcription. ### Comparison | Model | Size | Language | WER (MLS-FR) | |-------|------|----------|--------------| | Whisper-tiny | 39M | Multilingual | ~25% | | **Moonshine-tiny-fr** | 27M | French | **21.8%** | | Whisper-base | 74M | Multilingual | ~18% | *Moonshine-tiny-fr achieves competitive performance with 30% fewer parameters than Whisper-tiny. While being a proof of concept. More work should be done to create a proper and robust dataset.* ## Training Details / Fine tuning Please refer to my Github repo for the training procedure : ## Use Cases ### Primary Applications ✅ **French Speech Recognition** - Real-time transcription - Audio file transcription - Voice commands - Accessibility tools ✅ **Resource-Constrained Environments** - On-device transcription (mobile, edge devices) - Low-latency applications - Offline transcription ✅ **Hogwarts Legacy SpellCaster** - Ultra-lightweight and low latency spell speech recognition - https://github.com/pierre-cheneau/HogwartsLegacy-SpellCaster ## Limitations and Biases ### Known Limitations for this tiny model - **Hallucination:** Like all seq2seq models, may generate text not present in audio - **Repetition:** May repeat phrases, especially with greedy decoding (use beam search) - **Short Segments:** Performance may degrade on very short audio clips (<0.5s) - **Domain Specificity:** Trained primarily on audiobooks (read speech) - **Accents:** Best performance on metropolitan French; regional accents may have higher WER - **Background Noise:** Performance degrades with significant background noise ## Model Card Author **Pierre Chéneau (Cornebidouil)** Geologist, Developer and maintainer of this fine-tuned French model. **Links:** - 🌐 [Personal Website](https://pcheneau.fr) - 💼 [GitHub](https://github.com/pierre-cheneau) - 📚 [Fine-tuning Guide](https://github.com/pierre-cheneau/finetune-moonshine-asr) ## Citations ### This Model ```bibtex @misc{cheneau2026moonshine-tiny-fr, author = {Pierre Chéneau (Cornebidouil)}, title = {Moonshine-Tiny-FR: Fine-tuned French Speech Recognition}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/Cornebidouil/moonshine-tiny-fr} } ``` ### Fine tuning Guide ```bibtex @misc{cheneau2026moonshine-finetune, author = {Pierre Chéneau (Cornebidouil)}, title = {Moonshine ASR Fine-Tuning Guide}, year = {2026}, publisher = {GitHub}, url = {https://github.com/pierre-cheneau/finetune-moonshine-asr} } ``` ### Original Moonshine Model ```bibtex @misc{jeffries2024moonshinespeechrecognitionlive, title={Moonshine: Speech Recognition for Live Transcription and Voice Commands}, author={Nat Jeffries and Evan King and Manjunath Kudlur and Guy Nicholson and James Wang and Pete Warden}, year={2024}, eprint={2410.15608}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2410.15608}, } ``` ### Multilingual LibriSpeech Dataset ```bibtex @inproceedings{panayotov2015librispeech, title={Multilingual LibriSpeech: A Corpus for Speech Recognition in Multiple Languages}, author={Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and Synnaeve, Gabriel and Collobert, Ronan}, booktitle={Interspeech}, year={2020} } ``` ## Additional Resources - **Fine-Tuning Guide:** [Complete tutorial](https://github.com/pierre-cheneau/finetune-moonshine-asr) - **Original Moonshine:** [UsefulSensors/moonshine-tiny](https://huggingface.co/UsefulSensors/moonshine-tiny) - **Dataset:** [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) - **Issues/Support:** [GitHub Issues](https://github.com/pierre-cheneau/finetune-moonshine-asr/issues) ## License This model is released under the MIT License, consistent with the base Moonshine model. ``` MIT License Copyright (c) 2026 Pierre Chéneau (Cornebidouil) Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction... ``` ## Acknowledgments - **Useful Sensors** for the original Moonshine architecture and pre-trained model - **Meta AI** for the Multilingual LibriSpeech dataset - **HuggingFace** for the transformers library and model hosting - **Schedule-Free Learning** for the optimizer implementation --- **Questions?** Open an issue on the [fine-tuning guide repository](https://github.com/pierre-cheneau/finetune-moonshine-asr) or check the documentation. **Want to fine-tune for your language?** See the [complete fine-tuning guide](https://github.com/pierre-cheneau/finetune-moonshine-asr).